Research Post  ·  April 07, 2025

Benchmarking State-of-the-Art LLMs: A Rigorous, Reproducible Analysis

NatureCast Research · llm · 14 min read

There is a reproducibility crisis in LLM evaluation. When a lab reports 87.3% on MMLU, you need to know: Which subset? Which prompt template? 0-shot or 5-shot? With or without chain-of-thought? What temperature? What hardware? Without this information, the number is nearly meaningless — and yet the field publishes hundreds of such numbers every week.

We built OmniEval-LLM to fix this. In this post, we describe the framework and share reproducible results from our latest evaluation run.


The problem with LLM benchmarks

LLM benchmarks suffer from at least five systemic issues:

  1. Prompt sensitivity: GPT-4o accuracy on MMLU can vary by ±4% depending on whether the question is formatted as multiple choice with letters (A/B/C/D) vs. numbers (1/2/3/4).

  2. Contamination: Many popular benchmarks appear in common web scrapes. A model trained on data up to December 2024 may have seen MMLU, HumanEval, and GSM8K — making scores optimistic.

  3. Hardware variance: Quantised models (INT4, INT8) score differently than full-precision models. Few papers report this.

  4. Decoding parameters: Temperature, top-p, and repetition penalty all affect accuracy, especially on open-ended tasks.

  5. Reported vs. actual: Leaderboard entries are often not reproducible by independent parties.

OmniEval Approach

Every OmniEval run records: model ID, quantisation level, hardware spec, exact prompt template, decoding parameters, random seed, and timestamp. Results are reproducible to within ±0.5% across independent runs.

Benchmark suite

OmniEval covers the following tasks:

Benchmark Domain Shots Metric
MMLU Knowledge (57 subjects) 5-shot Accuracy
HumanEval Code generation 0-shot pass@1 pass@1
GSM8K Math word problems 8-shot CoT Accuracy
ARC-Challenge Science QA 25-shot Accuracy
BIG-Bench Hard Reasoning (23 tasks) 3-shot CoT Accuracy
HellaSwag Commonsense NLI 10-shot Accuracy
TruthfulQA Hallucination 0-shot MC1
GPQA PhD-level science 0-shot Accuracy

Models evaluated

We evaluate the following models in their publicly accessible API or open-weights forms:

All open-weights models are evaluated in bfloat16 on NVIDIA A100-80GB GPUs.

Results

Overall ranking

Model              MMLU    HumanEval  GSM8K   ARC-C   BBH     Avg
─────────────────────────────────────────────────────────────────
GPT-4o             88.7    90.2       95.1    96.3    83.1    90.7
Claude 3.5 Sonnet  88.3    92.0       95.6    94.8    86.4    91.4
Gemini 1.5 Pro     85.9    71.8       91.7    91.0    79.7    84.0
Qwen2 72B          84.2    64.6       91.1    93.1    79.4    82.5
Llama 3 70B        82.0    72.6       88.2    92.9    78.1    82.8
Mistral Large      81.2    60.2       87.7    90.6    73.0    78.5
─────────────────────────────────────────────────────────────────
Average Benchmark Score by Model 60% 70% 80% 90% 100% 90.7 GPT-4o 91.4 Claude 3.5 84.0 Gemini 1.5 82.5 Qwen2 72B 82.8 Llama3 70B 78.5 Mistral Lg Average across MMLU, HumanEval, GSM8K, ARC-C, BIG-Bench Hard
Figure 1. Average benchmark scores across five tasks. Claude 3.5 Sonnet leads on our suite; GPT-4o is marginally behind. All scores are 95% CI ≤ ±0.8%.

Key findings

1. Claude 3.5 Sonnet leads on code and reasoning Claude 3.5 achieves 92.0% on HumanEval (pass@1) — the highest of any model we tested. On BIG-Bench Hard, it is also the top performer at 86.4%, suggesting strong chain-of-thought reasoning.

2. GPT-4o leads on knowledge tasks MMLU (88.7%) and ARC-Challenge (96.3%) see GPT-4o at the top. The difference from Claude is within the margin of error on MMLU but consistent across bootstrap resampling.

3. Open-weight models are competitive on reasoning Llama 3 70B achieves 88.2% on GSM8K — remarkably close to the frontier closed models (95+%). For mathematical reasoning specifically, the capability gap between open and closed models has nearly closed.

4. Hallucination remains a universal problem TruthfulQA scores are uniformly disappointing: the best model (Claude 3.5) achieves only 71.3% on MC1. No model reliably avoids confident confabulation.

5. GPQA separates the frontier The Graduate-level Physics, Chemistry, and Biology Questions dataset (GPQA) is the hardest benchmark in our suite. GPT-4o achieves 53.6%, Claude 3.5 achieves 59.1% — both barely above human expert level (69.7%). Gemini and open-weight models cluster around 40–46%.

Calibration analysis

A well-calibrated model assigns higher confidence to correct answers. We measure calibration with Expected Calibration Error (ECE) across 10 bins:

Model              ECE ↓    Brier Score ↓
─────────────────────────────────────────
Claude 3.5 Sonnet  0.042    0.089
GPT-4o             0.051    0.097
Gemini 1.5 Pro     0.073    0.134
Llama 3 70B        0.088    0.158
Mistral Large      0.112    0.181

All models are overconfident, but Claude 3.5 is notably better calibrated. This matters for applications where model confidence is used downstream (e.g., retrieval-augmented generation with confidence thresholds).

Efficiency vs. accuracy

For many deployment scenarios, accuracy is not the only concern. We also measure tokens-per-second and cost per 1000 tokens:

Model Accuracy (avg) $/1M tokens Latency (tok/s)
Claude 3.5 Sonnet 91.4% $3.00 85
GPT-4o 90.7% $5.00 95
Gemini 1.5 Pro 84.0% $3.50 120
Llama 3 70B (self-hosted) 82.8% ~$0.30 45
Mistral Large 78.5% $4.00 110

Llama 3 70B at self-hosted cost is remarkable value: ~90% of frontier performance at ~6% of the API cost.

Reproducing our results

All evaluation scripts, prompt templates, and result files are in our GitHub repository:

git clone https://github.com/NatureCast/naturecast.github.io
cd omnieval

# Install dependencies
pip install -r requirements.txt

# Run MMLU evaluation on Llama 3 70B
python evaluate.py \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --benchmarks mmlu humaneval gsm8k \
  --shots 5 0 8 \
  --output results/llama3-70b.json

Each run generates a results file with full metadata including hardware spec, exact prompts, and per-sample outputs.

Conclusion

The LLM benchmark landscape is improving, but it still rewards optimistic reporting over rigorous science. Our key recommendations for practitioners:

  1. Never report a single number — report task, shots, prompt template, model version, and hardware
  2. Use multiple benchmarks — any single benchmark can be gamed; aggregate across diverse tasks
  3. Measure calibration — a model that knows what it doesn’t know is more useful than a slightly more accurate but overconfident one
  4. Test on your actual distribution — generic benchmarks are not a substitute for domain-specific evaluation

OmniEval-LLM is open-source and actively maintained. Contributions and issue reports are welcome at github.com/NatureCast.


References

Tags

LLM benchmarks evaluation reproducibility MMLU HumanEval