Monday, December 1, 2025

Which AI Models Answer Most Accurately, and Which Hallucinate Most? New Data Shows Clear Gaps

Recent findings from the European Broadcasting Union show that AI assistants misrepresent news content in 45% of the test cases, regardless of language or region. That result underscores why model accuracy and reliability remain central concerns. Fresh rankings from Artificial Analysis, based on real-world endpoint testing as of 1 December 2025, give a clear picture of how today’s leading systems perform when answering direct questions.

Measuring Accuracy and Hallucination Rates

Artificial Analysis evaluates both proprietary and open weights models through live API endpoints. Their measurements reflect what users experience in actual deployments rather than theoretical performance. Accuracy shows how often a model produces correct answers. Hallucination rate captures how often it responds incorrectly when it should refuse or indicate uncertainty. Since new models launch frequently and providers adjust endpoints, these results can change over time, but the current snapshot still reveals clear trends.

Models With the Highest Hallucination Rates

Hallucination Metrics Expose Deep Reliability Risks in Current AI Assistant Deployments
Model Hallucination Rate
Claude 4.5 Haiku 26%
Claude 4.5 Sonnet 48%
GPT-5.1 (high) 51%
Claude Opus 4.5 58%
Magistral Medium 1.2 60%
Grok 4 64%
Kimi K2 0905 69%
Grok 4.1 Fast 72%
Kimi K2 Thinking 74%
Llama Nemotron Super 49B v1.5 76%
DeepSeek V3.2 Ex 81%
DeepSeek R1 0528 83%
EXAONE 4.032B 86%
Llama 4 Maverick 87.58%
Gemini 3 Pro Preview (high) 87.99%
Gemini 2.5 Flash (Sep) 88.31%
Gemini 2.5 Pro 88.57%
MiniMax-M2 88.88%
GPT-5.1 89.17%
Qwen3 235B A22B 2507 89.64%
gpt-oss-120B (high) 89.96%
GLM-4.6 93.09%
gpt-oss-20B (high) 93.20%

When it comes to hallucination, the gap between models is striking. Claude 4.5 Haiku has the lowest hallucination rate in this group at 26 percent, yet even this relatively low figure indicates that incorrect answers are common. Several models climb sharply from there. Claude 4.5 Sonnet reaches 48 percent, GPT-5.1 (High) 51 percent, and Claude Opus 4.5 58 percent. Grok 4 produces incorrect responses 64 percent of the time, and Kimi K2 0905 rises to 69 percent. Beyond these, models enter the seventies and eighties. Grok 4.1 Fast shows a 72 percent rate, Kimi K2 Thinking 74 percent, and Llama Nemotron Super 49B v1.5 76 percent. DeepSeek benchmarks show even higher rates, with V3.2 Ex at 81 percent and R1 0528 at 83 percent. Among the highest are EXAONE 4.032B at 86 percent, Llama 4 Maverick at 87.58 percent, and several Gemini models including 3 Pro Preview (High) and 2.5 Flash (Sep) exceeding 87 percent. GLM-4.6 and gpt-oss-20B (High) top the chart at over 93 percent. This spread demonstrates that while some models are relatively restrained, many generate incorrect answers frequently, making hallucination a major challenge for AI systems today.

Top Performers in Accuracy

Testing Reveals Limited Accuracy Gains Despite Rapid Deployment of Advanced AI Systems
Model Accuracy
Gemini 3 Preview (High) 54%
Claude Opus 4.5 43%
Grok 4 40%
Gemini 2.5 Pro 37%
GPT-5.1 (High) 35%
Claude 4.5 Sonnet 31%
DeepSeek R1 0508 29.28%
Kimi K2 Thinking 29.23%
GPT-5.1 28%
Gemini 2.5 Flash (Sep) 27%
DeepSeek V3.2 Exp 27%
GLM-4.6 25%
Kimi K2 0905 24%
Llama 4 Maverick 24%
Grok 4.1 Fast 23.50%
Qwen3 235B A22B 2507 22%
MiniMax-M2 21%
Magistral Medium 1.2 20%
gpt-oss-120B (High) 20%
Claude 4.5 Haiku 16%
Llama Nemotron Super 49B v1.5 16%
gpt-oss-20B (High) 15%

Accuracy presents a different picture. Gemini 3 Preview (High) leads the pack at 54 percent, meaning it correctly answers just over half of all questions, followed by Claude Opus 4.5 at 43 percent and Grok 4 at 40 percent. Gemini 2.5 Pro comes next with 37 percent, while GPT-5.1 (High) reaches 35 percent and Claude 4.5 Sonnet 31 percent. A cluster of models then falls into the upper to mid-twenties: DeepSeek R1 0508 at 29.28 percent, Kimi K2 Thinking at 29.23 percent, GPT-5.1 at 28 percent, and both Gemini 2.5 Flash (Sep) and DeepSeek V3.2 Exp at 27 percent. The remaining models descend to GLM-4.6 at 25 percent, Kimi K2 0905 and Llama 4 Maverick at 24 percent, and EXAONE 4.032B at 13 percent. The spread highlights that even the top-performing models answer fewer than six out of ten questions correctly, showing the inherent difficulty AI faces in delivering consistently reliable responses across a broad set of prompts.

Clear Trade-offs

The contrast between hallucination and accuracy charts shows that strong accuracy does not guarantee low hallucination. Some high-ranking models in accuracy still produce incorrect answers at significant rates. Others deliver lower accuracy yet avoid the highest hallucination levels. These gaps illustrate how unpredictable model behavior remains, even as systems improve.

Read next: ChatGPT Doubles Usage as Google Gemini Reaches 40 Percent


by Irfan Ahmad via Digital Information World

No comments:

Post a Comment