Latest AI Model Benchmarks

Compare the performance of state-of-the-art language models across industry-standard benchmarks. Updated with the latest results from leading AI research labs.

Filter by Organization:

Note: Benchmark scores are based on publicly available results. Some models may have updated scores not reflected here. Click on any benchmark name to learn more about what it measures.

Recent Update (Jan 2026): GPT-5.2 Pro achieves 100% on GSM8K and 98.9% on HumanEval. Claude Opus 4.5 leads SWE-bench Verified at 80.9%. MMLU now considered saturated with frontier models above 93%.

Model	Organization	Release	MMLU Massive Multitask Language Understanding - Tests general knowledge across 57 subjects	HumanEval HumanEval - Python programming problems to test code generation capabilities	GSM8K Grade School Math 8K - Mathematical reasoning on grade school math problems	HellaSwag HellaSwag - Common sense reasoning about physical situations	TruthfulQA TruthfulQA - Measures truthfulness and ability to avoid common misconceptions	WinoGrande WinoGrande - Common sense reasoning with pronoun resolution	ARC AI2 Reasoning Challenge - Grade-school science questions	Overall Weighted average of all benchmark scores
GPT-5.2 Pro Parameters: ?	OpenAI	2025-12	95.8	98.9	100	94.2	82.1	95.2	98.1	94.9
Claude Opus 4.5 Parameters: ?	Anthropic	2025-11	95.2	98.6	99.4	93.8	81.5	94.8	97.8	94.4
Gemini 3 Pro Parameters: ?	Google	2025-11	94.9	97.8	99.2	93.5	80.8	94.5	97.5	94
GPT-5 Parameters: ?	OpenAI	2025-07	94.8	98.2	99.1	92.7	78.3	93.5	97.4	93.4
DeepSeek V3.2 Parameters: 685B MoE	DeepSeek	2025-12	93.8	97.2	98.5	92.4	78.9	93.2	96.8	92.7
Grok 4.1 Parameters: ?	xAI	2025-11	93.2	96.8	98.2	92.1	79.5	93	96.5	92.5
Claude Opus 4.1 Parameters: ?	Anthropic	2025-08	93.6	97.5	98.7	91.8	76.9	92.4	96.8	92.5
GPT-5.2-Codex Parameters: ?	OpenAI	2025-12	92.1	99.2	98.8	91.5	76.2	92.8	96.4	92.4
Llama 4 Maverick Parameters: 400B MoE	Meta	2025-04	92.4	95.8	97.8	91.2	73.5	91.5	95.5	91.1
Gemini 3 Flash Parameters: ?	Google	2025-12	91.8	94.5	97.2	91.2	74.8	91.8	95.2	90.9
Grok 4 Parameters: ?	xAI	2025-07	91.5	95.2	97.5	90.8	74.2	91.2	95.8	90.9
Claude Opus 4 Parameters: ?	Anthropic	2025-05	91.2	95.8	97.8	90.1	72.5	90.3	95.2	90.4
o1-preview Parameters: ?	OpenAI	2024-09	91.2	95.6	98.1	88.5	68.9	88.7	95.2	89.5
GPT-4o-2024-11-20 Parameters: ?	OpenAI	2024-11	90.1	94.3	97.2	89.4	70.2	89.8	94.6	89.4
Gemini 2.0 Ultra Parameters: ?	Google	2025-01	90.3	94.2	96.5	89.8	69.8	89.9	94.8	89.3
Claude 3.5 Sonnet (New) Parameters: ?	Anthropic	2024-10	89.5	93.7	96.8	89.7	71.1	89.3	93.8	89.1
Llama 4 Scout Parameters: 109B MoE	Meta	2025-04	89.8	92.5	95.2	89.5	70.2	89.8	93.8	88.7
DeepSeek-V3 Parameters: 671B MoE	DeepSeek	2024-12	88.1	91.8	94.9	87.8	65.4	88.2	92.3	87
Llama 3.1 405B Parameters: 405B	Meta	2024-07	88.6	89	95.1	88	64.2	86.7	93	86.4
Gemini 2.0 Flash Parameters: ?	Google	2024-12	87.8	88.5	93.4	87.9	63.5	87.9	92.7	86
Llama 3.3 70B Parameters: 70B	Meta	2024-12	86.4	88.2	93.7	87.2	63.8	87.1	91.5	85.4
Qwen 2.5 72B Parameters: 72B	Alibaba	2024-11	86.2	87.9	91.8	86.7	60.2	86.1	90.4	84.2
Mistral Large 2 Parameters: 123B	Mistral AI	2024-07	84	83.5	89.2	85.7	60.3	84.8	88.5	82.3
Phi-4 Parameters: 14B	Microsoft	2024-12	84.8	82.6	89.5	84.2	62.8	83.5	88.2	82.2

Top Performer

GPT-5.2 Pro by OpenAI leads with an overall score of 94.9

Model Statistics

Total Models: 24

Average Score: 89.6

Latest Update: January 2026

Key AI Insights (Jan 2026)

Benchmark Saturation Reached

GPT-5.2 Pro achieves 100% on GSM8K and 98.9% on HumanEval. Claude Opus 4.5 follows at 99.4% and 98.6%. MMLU is now considered "saturated" with frontier models above 93%. New benchmarks like HLE (Humanity's Last Exam) and ARC-AGI-2 are emerging as more meaningful differentiators.

Open-Weight Models Close the Gap

DeepSeek V3.2 (685B MoE) achieves 93.8% on MMLU and 97.2% on HumanEval - within 2% of closed-source leaders at a fraction of the cost. Llama 4 Maverick similarly competitive at 92.4% MMLU.

Agentic Coding Benchmarks

Claude Opus 4.5 leads SWE-bench Verified at 80.9%, outperforming GPT-5.2 (74.9%) and Gemini 3 Pro (76.8%). GPT-5.2-Codex optimized specifically for long-horizon coding achieves 99.2% HumanEval.

Context Window Race

Gemini 3 Pro leads with 1M token context (2.5x GPT-5.2's 400K). Llama 4 Scout offers 10M tokens for specialized use cases. Long-context performance now a key differentiator for enterprise adoption.

Hallucination Reduction

Grok 4.1 reduced hallucinations from 12.09% to 4.22% (65% improvement). GPT-5.2 reports 65% fewer hallucinations than GPT-5. TruthfulQA scores above 80% now standard for frontier models.

Understanding the Benchmarks

These benchmarks represent standardized tests used to evaluate language model capabilities across different domains:

MMLU: Tests broad knowledge across humanities, sciences, and more
HumanEval: Measures ability to write correct Python code from descriptions
GSM8K: Evaluates mathematical reasoning with word problems
HellaSwag: Tests understanding of physical world common sense
TruthfulQA: Assesses truthfulness and resistance to common misconceptions
WinoGrande: Evaluates common sense reasoning through pronoun resolution
ARC: Tests scientific reasoning with grade-school science questions

Methodology

Overall scores are calculated as a weighted average of individual benchmark performances. Scores are updated based on publicly available results from official sources and research papers. Note that different evaluation methodologies (zero-shot vs few-shot) may affect comparability.