Latest AI Model Benchmarks

Compare the performance of state-of-the-art language models across industry-standard benchmarks. Updated with the latest results from leading AI research labs.

Filter by Organization:

Note: Benchmark scores are based on publicly available results. Some models may have updated scores not reflected here. Click on any benchmark name to learn more about what it measures.

Recent Update (Jan 2026): GPT-5.2 Pro achieves 100% on GSM8K and 98.9% on HumanEval. Claude Opus 4.5 leads SWE-bench Verified at 80.9%. MMLU now considered saturated with frontier models above 93%.

Model
Organization
Release
MMLU
HumanEval
GSM8K
HellaSwag
TruthfulQA
WinoGrande
ARC
Overall
GPT-5.2 Pro
Parameters: ?
OpenAI2025-12
95.8
98.9
100
94.2
82.1
95.2
98.1
94.9
Claude Opus 4.5
Parameters: ?
Anthropic2025-11
95.2
98.6
99.4
93.8
81.5
94.8
97.8
94.4
Gemini 3 Pro
Parameters: ?
Google2025-11
94.9
97.8
99.2
93.5
80.8
94.5
97.5
94
GPT-5
Parameters: ?
OpenAI2025-07
94.8
98.2
99.1
92.7
78.3
93.5
97.4
93.4
DeepSeek V3.2
Parameters: 685B MoE
DeepSeek2025-12
93.8
97.2
98.5
92.4
78.9
93.2
96.8
92.7
Grok 4.1
Parameters: ?
xAI2025-11
93.2
96.8
98.2
92.1
79.5
93
96.5
92.5
Claude Opus 4.1
Parameters: ?
Anthropic2025-08
93.6
97.5
98.7
91.8
76.9
92.4
96.8
92.5
GPT-5.2-Codex
Parameters: ?
OpenAI2025-12
92.1
99.2
98.8
91.5
76.2
92.8
96.4
92.4
Llama 4 Maverick
Parameters: 400B MoE
Meta2025-04
92.4
95.8
97.8
91.2
73.5
91.5
95.5
91.1
Gemini 3 Flash
Parameters: ?
Google2025-12
91.8
94.5
97.2
91.2
74.8
91.8
95.2
90.9
Grok 4
Parameters: ?
xAI2025-07
91.5
95.2
97.5
90.8
74.2
91.2
95.8
90.9
Claude Opus 4
Parameters: ?
Anthropic2025-05
91.2
95.8
97.8
90.1
72.5
90.3
95.2
90.4
o1-preview
Parameters: ?
OpenAI2024-09
91.2
95.6
98.1
88.5
68.9
88.7
95.2
89.5
GPT-4o-2024-11-20
Parameters: ?
OpenAI2024-11
90.1
94.3
97.2
89.4
70.2
89.8
94.6
89.4
Gemini 2.0 Ultra
Parameters: ?
Google2025-01
90.3
94.2
96.5
89.8
69.8
89.9
94.8
89.3
Claude 3.5 Sonnet (New)
Parameters: ?
Anthropic2024-10
89.5
93.7
96.8
89.7
71.1
89.3
93.8
89.1
Llama 4 Scout
Parameters: 109B MoE
Meta2025-04
89.8
92.5
95.2
89.5
70.2
89.8
93.8
88.7
DeepSeek-V3
Parameters: 671B MoE
DeepSeek2024-12
88.1
91.8
94.9
87.8
65.4
88.2
92.3
87
Llama 3.1 405B
Parameters: 405B
Meta2024-07
88.6
89
95.1
88
64.2
86.7
93
86.4
Gemini 2.0 Flash
Parameters: ?
Google2024-12
87.8
88.5
93.4
87.9
63.5
87.9
92.7
86
Llama 3.3 70B
Parameters: 70B
Meta2024-12
86.4
88.2
93.7
87.2
63.8
87.1
91.5
85.4
Qwen 2.5 72B
Parameters: 72B
Alibaba2024-11
86.2
87.9
91.8
86.7
60.2
86.1
90.4
84.2
Mistral Large 2
Parameters: 123B
Mistral AI2024-07
84
83.5
89.2
85.7
60.3
84.8
88.5
82.3
Phi-4
Parameters: 14B
Microsoft2024-12
84.8
82.6
89.5
84.2
62.8
83.5
88.2
82.2

Top Performer

GPT-5.2 Pro by OpenAI leads with an overall score of 94.9

Model Statistics

Total Models: 24

Average Score: 89.6

Latest Update: January 2026

Key AI Insights (Jan 2026)

Benchmark Saturation Reached

GPT-5.2 Pro achieves 100% on GSM8K and 98.9% on HumanEval. Claude Opus 4.5 follows at 99.4% and 98.6%. MMLU is now considered "saturated" with frontier models above 93%. New benchmarks like HLE (Humanity's Last Exam) and ARC-AGI-2 are emerging as more meaningful differentiators.

Open-Weight Models Close the Gap

DeepSeek V3.2 (685B MoE) achieves 93.8% on MMLU and 97.2% on HumanEval - within 2% of closed-source leaders at a fraction of the cost. Llama 4 Maverick similarly competitive at 92.4% MMLU.

Agentic Coding Benchmarks

Claude Opus 4.5 leads SWE-bench Verified at 80.9%, outperforming GPT-5.2 (74.9%) and Gemini 3 Pro (76.8%). GPT-5.2-Codex optimized specifically for long-horizon coding achieves 99.2% HumanEval.

Context Window Race

Gemini 3 Pro leads with 1M token context (2.5x GPT-5.2's 400K). Llama 4 Scout offers 10M tokens for specialized use cases. Long-context performance now a key differentiator for enterprise adoption.

Hallucination Reduction

Grok 4.1 reduced hallucinations from 12.09% to 4.22% (65% improvement). GPT-5.2 reports 65% fewer hallucinations than GPT-5. TruthfulQA scores above 80% now standard for frontier models.

Understanding the Benchmarks

These benchmarks represent standardized tests used to evaluate language model capabilities across different domains:

  • MMLU: Tests broad knowledge across humanities, sciences, and more
  • HumanEval: Measures ability to write correct Python code from descriptions
  • GSM8K: Evaluates mathematical reasoning with word problems
  • HellaSwag: Tests understanding of physical world common sense
  • TruthfulQA: Assesses truthfulness and resistance to common misconceptions
  • WinoGrande: Evaluates common sense reasoning through pronoun resolution
  • ARC: Tests scientific reasoning with grade-school science questions

Methodology

Overall scores are calculated as a weighted average of individual benchmark performances. Scores are updated based on publicly available results from official sources and research papers. Note that different evaluation methodologies (zero-shot vs few-shot) may affect comparability.