Latest AI Model Benchmarks
Compare the performance of state-of-the-art language models across industry-standard benchmarks. Updated with the latest results from leading AI research labs.
Note: Benchmark scores are based on publicly available results. Some models may have updated scores not reflected here. Click on any benchmark name to learn more about what it measures.
Recent Update: Performance gaps between leading models have narrowed significantly - from double digits in 2023 to just 0.3-8.1 percentage points across major benchmarks by end of 2024.
Model | Organization | Release | MMLU | HumanEval | GSM8K | HellaSwag | TruthfulQA | WinoGrande | ARC | Overall |
---|---|---|---|---|---|---|---|---|---|---|
GPT-5 Parameters: ? | OpenAI | 2025-07 | 94.8 | 98.2 | 99.1 | 92.7 | 78.3 | 93.5 | 97.4 | 93.4 |
Claude 4.1 Parameters: ? | Anthropic | 2025-08 | 93.6 | 97.5 | 98.7 | 91.8 | 76.9 | 92.4 | 96.8 | 92.5 |
Claude Opus 4 Parameters: ? | Anthropic | 2025-05 | 91.2 | 95.8 | 97.8 | 90.1 | 72.5 | 90.3 | 95.2 | 90.4 |
o1-preview Parameters: ? | OpenAI | 2024-09 | 91.2 | 95.6 | 98.1 | 88.5 | 68.9 | 88.7 | 95.2 | 89.5 |
GPT-4o-2024-11-20 Parameters: ? | OpenAI | 2024-11 | 90.1 | 94.3 | 97.2 | 89.4 | 70.2 | 89.8 | 94.6 | 89.4 |
Gemini 2.0 Ultra Parameters: ? | 2025-01 | 90.3 | 94.2 | 96.5 | 89.8 | 69.8 | 89.9 | 94.8 | 89.3 | |
Claude 3.5 Sonnet (New) Parameters: ? | Anthropic | 2024-10 | 89.5 | 93.7 | 96.8 | 89.7 | 71.1 | 89.3 | 93.8 | 89.1 |
Claude 3.5 Sonnet Parameters: ? | Anthropic | 2024-06 | 88.3 | 92 | 96.4 | 89 | 68 | 88.6 | 92 | 87.8 |
DeepSeek-V3 Parameters: 671B | DeepSeek | 2024-12 | 88.1 | 91.8 | 94.9 | 87.8 | 65.4 | 88.2 | 92.3 | 87 |
Llama 3.1 405B Parameters: 405B | Meta | 2024-07 | 88.6 | 89 | 95.1 | 88 | 64.2 | 86.7 | 93 | 86.4 |
Gemini 2.0 Flash Parameters: ? | 2024-12 | 87.8 | 88.5 | 93.4 | 87.9 | 63.5 | 87.9 | 92.7 | 86 | |
GPT-4o Parameters: ? | OpenAI | 2024-05 | 88.7 | 90.2 | 92 | 87.1 | 61.1 | 87.5 | 93.3 | 85.7 |
Llama 3.3 70B Parameters: 70B | Meta | 2024-12 | 86.4 | 88.2 | 93.7 | 87.2 | 63.8 | 87.1 | 91.5 | 85.4 |
Qwen 2.5 72B Parameters: 72B | Alibaba | 2024-11 | 86.2 | 87.9 | 91.8 | 86.7 | 60.2 | 86.1 | 90.4 | 84.2 |
Gemini 1.5 Pro Parameters: ? | 2024-05 | 85.9 | 84.1 | 91.7 | 86.5 | 59.2 | 86.7 | 91.8 | 83.7 | |
Mistral Large 2 Parameters: 123B | Mistral AI | 2024-07 | 84 | 83.5 | 89.2 | 85.7 | 60.3 | 84.8 | 88.5 | 82.3 |
Command R+ Parameters: 104B | Cohere | 2024-04 | 79.5 | 75.6 | 82.1 | 83.2 | 56.1 | 82.9 | 85.7 | 77.9 |
Phi-3-mini Parameters: 3.8B | Microsoft | 2025-06 | 68.8 | 61.2 | 73.5 | 75.2 | 55.3 | 73.8 | 77.5 | 69.3 |
Top Performer
GPT-5 by OpenAI leads with an overall score of 93.4
Model Statistics
Total Models: 18
Average Score: 86.1
Latest Update: August 2025
Key AI Insights (2025)
Breaking Performance Barriers
GPT-5 and Claude 4.1 have pushed AI capabilities to new heights: GPT-5 achieves 98.2% on HumanEval and99.1% on GSM8K, while Claude 4.1 follows closely with 97.5% and 98.7% respectively. These models are approaching near-perfect performance on standard benchmarks.
Model Efficiency Revolution
Dramatic improvements in AI efficiency: Microsoft's Phi-3-mini (3.8B parameters) achieved 60%+ on MMLU - a 142-fold reduction from 2022's PaLM (540B parameters) for the same performance threshold.
Cost Reduction
AI inference costs have plummeted: GPT-3.5 equivalent queries dropped from $20 per million tokens (Nov 2022) to just $0.07 (Oct 2024) - a 280-fold reduction.
Global Competition
While the US leads with 40 notable models vs China's 15 in 2024, Chinese models have rapidly closed the quality gap. Performance differences narrowed from double digits in 2023 to near parity in 2024.
AI Agents Show Promise
In short time-horizon tasks (2 hours), AI systems score 4x higher than human experts. However, humans still outperform AI 2-to-1 when given 32 hours to complete tasks.
Understanding the Benchmarks
These benchmarks represent standardized tests used to evaluate language model capabilities across different domains:
- MMLU: Tests broad knowledge across humanities, sciences, and more
- HumanEval: Measures ability to write correct Python code from descriptions
- GSM8K: Evaluates mathematical reasoning with word problems
- HellaSwag: Tests understanding of physical world common sense
- TruthfulQA: Assesses truthfulness and resistance to common misconceptions
- WinoGrande: Evaluates common sense reasoning through pronoun resolution
- ARC: Tests scientific reasoning with grade-school science questions
Methodology
Overall scores are calculated as a weighted average of individual benchmark performances. Scores are updated based on publicly available results from official sources and research papers. Note that different evaluation methodologies (zero-shot vs few-shot) may affect comparability.