Latest AI Model Benchmarks

Compare the performance of state-of-the-art language models across industry-standard benchmarks. Updated with the latest results from leading AI research labs.

Filter by Organization:

Note: Benchmark scores are based on publicly available results. Some models may have updated scores not reflected here. Click on any benchmark name to learn more about what it measures.

Recent Update: Performance gaps between leading models have narrowed significantly - from double digits in 2023 to just 0.3-8.1 percentage points across major benchmarks by end of 2024.

Model
Organization
Release
MMLU
HumanEval
GSM8K
HellaSwag
TruthfulQA
WinoGrande
ARC
Overall
GPT-5
Parameters: ?
OpenAI2025-07
94.8
98.2
99.1
92.7
78.3
93.5
97.4
93.4
Claude 4.1
Parameters: ?
Anthropic2025-08
93.6
97.5
98.7
91.8
76.9
92.4
96.8
92.5
Claude Opus 4
Parameters: ?
Anthropic2025-05
91.2
95.8
97.8
90.1
72.5
90.3
95.2
90.4
o1-preview
Parameters: ?
OpenAI2024-09
91.2
95.6
98.1
88.5
68.9
88.7
95.2
89.5
GPT-4o-2024-11-20
Parameters: ?
OpenAI2024-11
90.1
94.3
97.2
89.4
70.2
89.8
94.6
89.4
Gemini 2.0 Ultra
Parameters: ?
Google2025-01
90.3
94.2
96.5
89.8
69.8
89.9
94.8
89.3
Claude 3.5 Sonnet (New)
Parameters: ?
Anthropic2024-10
89.5
93.7
96.8
89.7
71.1
89.3
93.8
89.1
Claude 3.5 Sonnet
Parameters: ?
Anthropic2024-06
88.3
92
96.4
89
68
88.6
92
87.8
DeepSeek-V3
Parameters: 671B
DeepSeek2024-12
88.1
91.8
94.9
87.8
65.4
88.2
92.3
87
Llama 3.1 405B
Parameters: 405B
Meta2024-07
88.6
89
95.1
88
64.2
86.7
93
86.4
Gemini 2.0 Flash
Parameters: ?
Google2024-12
87.8
88.5
93.4
87.9
63.5
87.9
92.7
86
GPT-4o
Parameters: ?
OpenAI2024-05
88.7
90.2
92
87.1
61.1
87.5
93.3
85.7
Llama 3.3 70B
Parameters: 70B
Meta2024-12
86.4
88.2
93.7
87.2
63.8
87.1
91.5
85.4
Qwen 2.5 72B
Parameters: 72B
Alibaba2024-11
86.2
87.9
91.8
86.7
60.2
86.1
90.4
84.2
Gemini 1.5 Pro
Parameters: ?
Google2024-05
85.9
84.1
91.7
86.5
59.2
86.7
91.8
83.7
Mistral Large 2
Parameters: 123B
Mistral AI2024-07
84
83.5
89.2
85.7
60.3
84.8
88.5
82.3
Command R+
Parameters: 104B
Cohere2024-04
79.5
75.6
82.1
83.2
56.1
82.9
85.7
77.9
Phi-3-mini
Parameters: 3.8B
Microsoft2025-06
68.8
61.2
73.5
75.2
55.3
73.8
77.5
69.3

Top Performer

GPT-5 by OpenAI leads with an overall score of 93.4

Model Statistics

Total Models: 18

Average Score: 86.1

Latest Update: August 2025

Key AI Insights (2025)

Breaking Performance Barriers

GPT-5 and Claude 4.1 have pushed AI capabilities to new heights: GPT-5 achieves 98.2% on HumanEval and99.1% on GSM8K, while Claude 4.1 follows closely with 97.5% and 98.7% respectively. These models are approaching near-perfect performance on standard benchmarks.

Model Efficiency Revolution

Dramatic improvements in AI efficiency: Microsoft's Phi-3-mini (3.8B parameters) achieved 60%+ on MMLU - a 142-fold reduction from 2022's PaLM (540B parameters) for the same performance threshold.

Cost Reduction

AI inference costs have plummeted: GPT-3.5 equivalent queries dropped from $20 per million tokens (Nov 2022) to just $0.07 (Oct 2024) - a 280-fold reduction.

Global Competition

While the US leads with 40 notable models vs China's 15 in 2024, Chinese models have rapidly closed the quality gap. Performance differences narrowed from double digits in 2023 to near parity in 2024.

AI Agents Show Promise

In short time-horizon tasks (2 hours), AI systems score 4x higher than human experts. However, humans still outperform AI 2-to-1 when given 32 hours to complete tasks.

Understanding the Benchmarks

These benchmarks represent standardized tests used to evaluate language model capabilities across different domains:

  • MMLU: Tests broad knowledge across humanities, sciences, and more
  • HumanEval: Measures ability to write correct Python code from descriptions
  • GSM8K: Evaluates mathematical reasoning with word problems
  • HellaSwag: Tests understanding of physical world common sense
  • TruthfulQA: Assesses truthfulness and resistance to common misconceptions
  • WinoGrande: Evaluates common sense reasoning through pronoun resolution
  • ARC: Tests scientific reasoning with grade-school science questions

Methodology

Overall scores are calculated as a weighted average of individual benchmark performances. Scores are updated based on publicly available results from official sources and research papers. Note that different evaluation methodologies (zero-shot vs few-shot) may affect comparability.