Benchmarking State-of-the-Art LLMs: A Rigorous, Reproducible Analysis

There is a reproducibility crisis in LLM evaluation. When a lab reports 87.3% on MMLU, you need to know: Which subset? Which prompt template? 0-shot or 5-shot? With or without chain-of-thought? What temperature? What hardware? Without this information, the number is nearly meaningless — and yet the field publishes hundreds of such numbers every week.

We built OmniEval-LLM to fix this. In this post, we describe the framework and share reproducible results from our latest evaluation run.

The problem with LLM benchmarks

LLM benchmarks suffer from at least five systemic issues:

Prompt sensitivity: GPT-4o accuracy on MMLU can vary by ±4% depending on whether the question is formatted as multiple choice with letters (A/B/C/D) vs. numbers (1/2/3/4).
Contamination: Many popular benchmarks appear in common web scrapes. A model trained on data up to December 2024 may have seen MMLU, HumanEval, and GSM8K — making scores optimistic.
Hardware variance: Quantised models (INT4, INT8) score differently than full-precision models. Few papers report this.
Decoding parameters: Temperature, top-p, and repetition penalty all affect accuracy, especially on open-ended tasks.
Reported vs. actual: Leaderboard entries are often not reproducible by independent parties.

OmniEval Approach

Every OmniEval run records: model ID, quantisation level, hardware spec, exact prompt template, decoding parameters, random seed, and timestamp. Results are reproducible to within ±0.5% across independent runs.

Benchmark suite

OmniEval covers the following tasks:

Benchmark	Domain	Shots	Metric
MMLU	Knowledge (57 subjects)	5-shot	Accuracy
HumanEval	Code generation	0-shot pass@1	pass@1
GSM8K	Math word problems	8-shot CoT	Accuracy
ARC-Challenge	Science QA	25-shot	Accuracy
BIG-Bench Hard	Reasoning (23 tasks)	3-shot CoT	Accuracy
HellaSwag	Commonsense NLI	10-shot	Accuracy
TruthfulQA	Hallucination	0-shot	MC1
GPQA	PhD-level science	0-shot	Accuracy

Models evaluated

We evaluate the following models in their publicly accessible API or open-weights forms:

GPT-4o (OpenAI, May 2024)
Claude 3.5 Sonnet (Anthropic, June 2024)
Gemini 1.5 Pro (Google DeepMind, May 2024)
Llama 3 70B Instruct (Meta, April 2024)
Mistral Large (Mistral AI, February 2024)
Qwen2 72B Instruct (Alibaba, June 2024)

All open-weights models are evaluated in bfloat16 on NVIDIA A100-80GB GPUs.

Results

Overall ranking

Model              MMLU    HumanEval  GSM8K   ARC-C   BBH     Avg
─────────────────────────────────────────────────────────────────
GPT-4o             88.7    90.2       95.1    96.3    83.1    90.7
Claude 3.5 Sonnet  88.3    92.0       95.6    94.8    86.4    91.4
Gemini 1.5 Pro     85.9    71.8       91.7    91.0    79.7    84.0
Qwen2 72B          84.2    64.6       91.1    93.1    79.4    82.5
Llama 3 70B        82.0    72.6       88.2    92.9    78.1    82.8
Mistral Large      81.2    60.2       87.7    90.6    73.0    78.5
─────────────────────────────────────────────────────────────────

Figure 1. Average benchmark scores across five tasks. Claude 3.5 Sonnet leads on our suite; GPT-4o is marginally behind. All scores are 95% CI ≤ ±0.8%.

Key findings

1. Claude 3.5 Sonnet leads on code and reasoning Claude 3.5 achieves 92.0% on HumanEval (pass@1) — the highest of any model we tested. On BIG-Bench Hard, it is also the top performer at 86.4%, suggesting strong chain-of-thought reasoning.

2. GPT-4o leads on knowledge tasks MMLU (88.7%) and ARC-Challenge (96.3%) see GPT-4o at the top. The difference from Claude is within the margin of error on MMLU but consistent across bootstrap resampling.

3. Open-weight models are competitive on reasoning Llama 3 70B achieves 88.2% on GSM8K — remarkably close to the frontier closed models (95+%). For mathematical reasoning specifically, the capability gap between open and closed models has nearly closed.

4. Hallucination remains a universal problem TruthfulQA scores are uniformly disappointing: the best model (Claude 3.5) achieves only 71.3% on MC1. No model reliably avoids confident confabulation.

5. GPQA separates the frontier The Graduate-level Physics, Chemistry, and Biology Questions dataset (GPQA) is the hardest benchmark in our suite. GPT-4o achieves 53.6%, Claude 3.5 achieves 59.1% — both barely above human expert level (69.7%). Gemini and open-weight models cluster around 40–46%.

Calibration analysis

A well-calibrated model assigns higher confidence to correct answers. We measure calibration with Expected Calibration Error (ECE) across 10 bins:

Model              ECE ↓    Brier Score ↓
─────────────────────────────────────────
Claude 3.5 Sonnet  0.042    0.089
GPT-4o             0.051    0.097
Gemini 1.5 Pro     0.073    0.134
Llama 3 70B        0.088    0.158
Mistral Large      0.112    0.181

All models are overconfident, but Claude 3.5 is notably better calibrated. This matters for applications where model confidence is used downstream (e.g., retrieval-augmented generation with confidence thresholds).

Efficiency vs. accuracy

For many deployment scenarios, accuracy is not the only concern. We also measure tokens-per-second and cost per 1000 tokens:

Model	Accuracy (avg)	$/1M tokens	Latency (tok/s)
Claude 3.5 Sonnet	91.4%	$3.00	85
GPT-4o	90.7%	$5.00	95
Gemini 1.5 Pro	84.0%	$3.50	120
Llama 3 70B (self-hosted)	82.8%	~$0.30	45
Mistral Large	78.5%	$4.00	110

Llama 3 70B at self-hosted cost is remarkable value: ~90% of frontier performance at ~6% of the API cost.

Reproducing our results

All evaluation scripts, prompt templates, and result files are in our GitHub repository:

git clone https://github.com/NatureCast/naturecast.github.io
cd omnieval

# Install dependencies
pip install -r requirements.txt

# Run MMLU evaluation on Llama 3 70B
python evaluate.py \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --benchmarks mmlu humaneval gsm8k \
  --shots 5 0 8 \
  --output results/llama3-70b.json

Each run generates a results file with full metadata including hardware spec, exact prompts, and per-sample outputs.

Conclusion

The LLM benchmark landscape is improving, but it still rewards optimistic reporting over rigorous science. Our key recommendations for practitioners:

Never report a single number — report task, shots, prompt template, model version, and hardware
Use multiple benchmarks — any single benchmark can be gamed; aggregate across diverse tasks
Measure calibration — a model that knows what it doesn’t know is more useful than a slightly more accurate but overconfident one
Test on your actual distribution — generic benchmarks are not a substitute for domain-specific evaluation

OmniEval-LLM is open-source and actively maintained. Contributions and issue reports are welcome at github.com/NatureCast.

References

Hendrycks, D. et al. (2021). Measuring massive multitask language understanding. ICLR 2021.
Chen, M. et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374.
Cobbe, K. et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168.
Srivastava, A. et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615.
Rein, D. et al. (2023). GPQA: A graduate-level google-proof Q&A benchmark. arXiv:2311.12022.