There is a reproducibility crisis in LLM evaluation. When a lab reports 87.3% on MMLU, you need to know: Which subset? Which prompt template? 0-shot or 5-shot? With or without chain-of-thought? What temperature? What hardware? Without this information, the number is nearly meaningless — and yet the field publishes hundreds of such numbers every week.
We built OmniEval-LLM to fix this. In this post, we describe the framework and share reproducible results from our latest evaluation run.
The problem with LLM benchmarks
LLM benchmarks suffer from at least five systemic issues:
-
Prompt sensitivity: GPT-4o accuracy on MMLU can vary by ±4% depending on whether the question is formatted as multiple choice with letters (A/B/C/D) vs. numbers (1/2/3/4).
-
Contamination: Many popular benchmarks appear in common web scrapes. A model trained on data up to December 2024 may have seen MMLU, HumanEval, and GSM8K — making scores optimistic.
-
Hardware variance: Quantised models (INT4, INT8) score differently than full-precision models. Few papers report this.
-
Decoding parameters: Temperature, top-p, and repetition penalty all affect accuracy, especially on open-ended tasks.
-
Reported vs. actual: Leaderboard entries are often not reproducible by independent parties.
OmniEval Approach
Every OmniEval run records: model ID, quantisation level, hardware spec, exact prompt template, decoding parameters, random seed, and timestamp. Results are reproducible to within ±0.5% across independent runs.
Benchmark suite
OmniEval covers the following tasks:
| Benchmark | Domain | Shots | Metric |
|---|---|---|---|
| MMLU | Knowledge (57 subjects) | 5-shot | Accuracy |
| HumanEval | Code generation | 0-shot pass@1 | pass@1 |
| GSM8K | Math word problems | 8-shot CoT | Accuracy |
| ARC-Challenge | Science QA | 25-shot | Accuracy |
| BIG-Bench Hard | Reasoning (23 tasks) | 3-shot CoT | Accuracy |
| HellaSwag | Commonsense NLI | 10-shot | Accuracy |
| TruthfulQA | Hallucination | 0-shot | MC1 |
| GPQA | PhD-level science | 0-shot | Accuracy |
Models evaluated
We evaluate the following models in their publicly accessible API or open-weights forms:
- GPT-4o (OpenAI, May 2024)
- Claude 3.5 Sonnet (Anthropic, June 2024)
- Gemini 1.5 Pro (Google DeepMind, May 2024)
- Llama 3 70B Instruct (Meta, April 2024)
- Mistral Large (Mistral AI, February 2024)
- Qwen2 72B Instruct (Alibaba, June 2024)
All open-weights models are evaluated in bfloat16 on NVIDIA A100-80GB GPUs.
Results
Overall ranking
Model MMLU HumanEval GSM8K ARC-C BBH Avg
─────────────────────────────────────────────────────────────────
GPT-4o 88.7 90.2 95.1 96.3 83.1 90.7
Claude 3.5 Sonnet 88.3 92.0 95.6 94.8 86.4 91.4
Gemini 1.5 Pro 85.9 71.8 91.7 91.0 79.7 84.0
Qwen2 72B 84.2 64.6 91.1 93.1 79.4 82.5
Llama 3 70B 82.0 72.6 88.2 92.9 78.1 82.8
Mistral Large 81.2 60.2 87.7 90.6 73.0 78.5
─────────────────────────────────────────────────────────────────
Key findings
1. Claude 3.5 Sonnet leads on code and reasoning Claude 3.5 achieves 92.0% on HumanEval (pass@1) — the highest of any model we tested. On BIG-Bench Hard, it is also the top performer at 86.4%, suggesting strong chain-of-thought reasoning.
2. GPT-4o leads on knowledge tasks MMLU (88.7%) and ARC-Challenge (96.3%) see GPT-4o at the top. The difference from Claude is within the margin of error on MMLU but consistent across bootstrap resampling.
3. Open-weight models are competitive on reasoning Llama 3 70B achieves 88.2% on GSM8K — remarkably close to the frontier closed models (95+%). For mathematical reasoning specifically, the capability gap between open and closed models has nearly closed.
4. Hallucination remains a universal problem TruthfulQA scores are uniformly disappointing: the best model (Claude 3.5) achieves only 71.3% on MC1. No model reliably avoids confident confabulation.
5. GPQA separates the frontier The Graduate-level Physics, Chemistry, and Biology Questions dataset (GPQA) is the hardest benchmark in our suite. GPT-4o achieves 53.6%, Claude 3.5 achieves 59.1% — both barely above human expert level (69.7%). Gemini and open-weight models cluster around 40–46%.
Calibration analysis
A well-calibrated model assigns higher confidence to correct answers. We measure calibration with Expected Calibration Error (ECE) across 10 bins:
Model ECE ↓ Brier Score ↓
─────────────────────────────────────────
Claude 3.5 Sonnet 0.042 0.089
GPT-4o 0.051 0.097
Gemini 1.5 Pro 0.073 0.134
Llama 3 70B 0.088 0.158
Mistral Large 0.112 0.181
All models are overconfident, but Claude 3.5 is notably better calibrated. This matters for applications where model confidence is used downstream (e.g., retrieval-augmented generation with confidence thresholds).
Efficiency vs. accuracy
For many deployment scenarios, accuracy is not the only concern. We also measure tokens-per-second and cost per 1000 tokens:
| Model | Accuracy (avg) | $/1M tokens | Latency (tok/s) |
|---|---|---|---|
| Claude 3.5 Sonnet | 91.4% | $3.00 | 85 |
| GPT-4o | 90.7% | $5.00 | 95 |
| Gemini 1.5 Pro | 84.0% | $3.50 | 120 |
| Llama 3 70B (self-hosted) | 82.8% | ~$0.30 | 45 |
| Mistral Large | 78.5% | $4.00 | 110 |
Llama 3 70B at self-hosted cost is remarkable value: ~90% of frontier performance at ~6% of the API cost.
Reproducing our results
All evaluation scripts, prompt templates, and result files are in our GitHub repository:
git clone https://github.com/NatureCast/naturecast.github.io
cd omnieval
# Install dependencies
pip install -r requirements.txt
# Run MMLU evaluation on Llama 3 70B
python evaluate.py \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--benchmarks mmlu humaneval gsm8k \
--shots 5 0 8 \
--output results/llama3-70b.json
Each run generates a results file with full metadata including hardware spec, exact prompts, and per-sample outputs.
Conclusion
The LLM benchmark landscape is improving, but it still rewards optimistic reporting over rigorous science. Our key recommendations for practitioners:
- Never report a single number — report task, shots, prompt template, model version, and hardware
- Use multiple benchmarks — any single benchmark can be gamed; aggregate across diverse tasks
- Measure calibration — a model that knows what it doesn’t know is more useful than a slightly more accurate but overconfident one
- Test on your actual distribution — generic benchmarks are not a substitute for domain-specific evaluation
OmniEval-LLM is open-source and actively maintained. Contributions and issue reports are welcome at github.com/NatureCast.
References
- Hendrycks, D. et al. (2021). Measuring massive multitask language understanding. ICLR 2021.
- Chen, M. et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374.
- Cobbe, K. et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168.
- Srivastava, A. et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615.
- Rein, D. et al. (2023). GPQA: A graduate-level google-proof Q&A benchmark. arXiv:2311.12022.
Tags