Latest AI Model Benchmarks
Compare the performance of state-of-the-art language models across industry-standard benchmarks. Updated with the latest results from leading AI research labs.
Note: Benchmark scores are based on publicly available results. Some models may have updated scores not reflected here. Click on any benchmark name to learn more about what it measures.
Recent Update (Jan 2026): GPT-5.2 Pro achieves 100% on GSM8K and 98.9% on HumanEval. Claude Opus 4.5 leads SWE-bench Verified at 80.9%. MMLU now considered saturated with frontier models above 93%.
Model | Organization | Release | MMLU | HumanEval | GSM8K | HellaSwag | TruthfulQA | WinoGrande | ARC | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
GPT-5.2 Pro Parameters: ? | OpenAI | 2025-12 | 95.8 | 98.9 | 100 | 94.2 | 82.1 | 95.2 | 98.1 | 94.9 |
Claude Opus 4.5 Parameters: ? | Anthropic | 2025-11 | 95.2 | 98.6 | 99.4 | 93.8 | 81.5 | 94.8 | 97.8 | 94.4 |
Gemini 3 Pro Parameters: ? | 2025-11 | 94.9 | 97.8 | 99.2 | 93.5 | 80.8 | 94.5 | 97.5 | 94 | |
GPT-5 Parameters: ? | OpenAI | 2025-07 | 94.8 | 98.2 | 99.1 | 92.7 | 78.3 | 93.5 | 97.4 | 93.4 |
DeepSeek V3.2 Parameters: 685B MoE | DeepSeek | 2025-12 | 93.8 | 97.2 | 98.5 | 92.4 | 78.9 | 93.2 | 96.8 | 92.7 |
Grok 4.1 Parameters: ? | xAI | 2025-11 | 93.2 | 96.8 | 98.2 | 92.1 | 79.5 | 93 | 96.5 | 92.5 |
Claude Opus 4.1 Parameters: ? | Anthropic | 2025-08 | 93.6 | 97.5 | 98.7 | 91.8 | 76.9 | 92.4 | 96.8 | 92.5 |
GPT-5.2-Codex Parameters: ? | OpenAI | 2025-12 | 92.1 | 99.2 | 98.8 | 91.5 | 76.2 | 92.8 | 96.4 | 92.4 |
Llama 4 Maverick Parameters: 400B MoE | Meta | 2025-04 | 92.4 | 95.8 | 97.8 | 91.2 | 73.5 | 91.5 | 95.5 | 91.1 |
Gemini 3 Flash Parameters: ? | 2025-12 | 91.8 | 94.5 | 97.2 | 91.2 | 74.8 | 91.8 | 95.2 | 90.9 | |
Grok 4 Parameters: ? | xAI | 2025-07 | 91.5 | 95.2 | 97.5 | 90.8 | 74.2 | 91.2 | 95.8 | 90.9 |
Claude Opus 4 Parameters: ? | Anthropic | 2025-05 | 91.2 | 95.8 | 97.8 | 90.1 | 72.5 | 90.3 | 95.2 | 90.4 |
o1-preview Parameters: ? | OpenAI | 2024-09 | 91.2 | 95.6 | 98.1 | 88.5 | 68.9 | 88.7 | 95.2 | 89.5 |
GPT-4o-2024-11-20 Parameters: ? | OpenAI | 2024-11 | 90.1 | 94.3 | 97.2 | 89.4 | 70.2 | 89.8 | 94.6 | 89.4 |
Gemini 2.0 Ultra Parameters: ? | 2025-01 | 90.3 | 94.2 | 96.5 | 89.8 | 69.8 | 89.9 | 94.8 | 89.3 | |
Claude 3.5 Sonnet (New) Parameters: ? | Anthropic | 2024-10 | 89.5 | 93.7 | 96.8 | 89.7 | 71.1 | 89.3 | 93.8 | 89.1 |
Llama 4 Scout Parameters: 109B MoE | Meta | 2025-04 | 89.8 | 92.5 | 95.2 | 89.5 | 70.2 | 89.8 | 93.8 | 88.7 |
DeepSeek-V3 Parameters: 671B MoE | DeepSeek | 2024-12 | 88.1 | 91.8 | 94.9 | 87.8 | 65.4 | 88.2 | 92.3 | 87 |
Llama 3.1 405B Parameters: 405B | Meta | 2024-07 | 88.6 | 89 | 95.1 | 88 | 64.2 | 86.7 | 93 | 86.4 |
Gemini 2.0 Flash Parameters: ? | 2024-12 | 87.8 | 88.5 | 93.4 | 87.9 | 63.5 | 87.9 | 92.7 | 86 | |
Llama 3.3 70B Parameters: 70B | Meta | 2024-12 | 86.4 | 88.2 | 93.7 | 87.2 | 63.8 | 87.1 | 91.5 | 85.4 |
Qwen 2.5 72B Parameters: 72B | Alibaba | 2024-11 | 86.2 | 87.9 | 91.8 | 86.7 | 60.2 | 86.1 | 90.4 | 84.2 |
Mistral Large 2 Parameters: 123B | Mistral AI | 2024-07 | 84 | 83.5 | 89.2 | 85.7 | 60.3 | 84.8 | 88.5 | 82.3 |
Phi-4 Parameters: 14B | Microsoft | 2024-12 | 84.8 | 82.6 | 89.5 | 84.2 | 62.8 | 83.5 | 88.2 | 82.2 |
Top Performer
GPT-5.2 Pro by OpenAI leads with an overall score of 94.9
Model Statistics
Total Models: 24
Average Score: 89.6
Latest Update: January 2026
Key AI Insights (Jan 2026)
Benchmark Saturation Reached
GPT-5.2 Pro achieves 100% on GSM8K and 98.9% on HumanEval. Claude Opus 4.5 follows at 99.4% and 98.6%. MMLU is now considered "saturated" with frontier models above 93%. New benchmarks like HLE (Humanity's Last Exam) and ARC-AGI-2 are emerging as more meaningful differentiators.
Open-Weight Models Close the Gap
DeepSeek V3.2 (685B MoE) achieves 93.8% on MMLU and 97.2% on HumanEval - within 2% of closed-source leaders at a fraction of the cost. Llama 4 Maverick similarly competitive at 92.4% MMLU.
Agentic Coding Benchmarks
Claude Opus 4.5 leads SWE-bench Verified at 80.9%, outperforming GPT-5.2 (74.9%) and Gemini 3 Pro (76.8%). GPT-5.2-Codex optimized specifically for long-horizon coding achieves 99.2% HumanEval.
Context Window Race
Gemini 3 Pro leads with 1M token context (2.5x GPT-5.2's 400K). Llama 4 Scout offers 10M tokens for specialized use cases. Long-context performance now a key differentiator for enterprise adoption.
Hallucination Reduction
Grok 4.1 reduced hallucinations from 12.09% to 4.22% (65% improvement). GPT-5.2 reports 65% fewer hallucinations than GPT-5. TruthfulQA scores above 80% now standard for frontier models.
Understanding the Benchmarks
These benchmarks represent standardized tests used to evaluate language model capabilities across different domains:
- MMLU: Tests broad knowledge across humanities, sciences, and more
- HumanEval: Measures ability to write correct Python code from descriptions
- GSM8K: Evaluates mathematical reasoning with word problems
- HellaSwag: Tests understanding of physical world common sense
- TruthfulQA: Assesses truthfulness and resistance to common misconceptions
- WinoGrande: Evaluates common sense reasoning through pronoun resolution
- ARC: Tests scientific reasoning with grade-school science questions
Methodology
Overall scores are calculated as a weighted average of individual benchmark performances. Scores are updated based on publicly available results from official sources and research papers. Note that different evaluation methodologies (zero-shot vs few-shot) may affect comparability.