Recent findings from the European Broadcasting Union show that AI assistants misrepresent news content in 45% of the test cases, regardless of language or region. That result underscores why model accuracy and reliability remain central concerns. Fresh rankings from Artificial Analysis, based on real-world endpoint testing as of 1 December 2025, give a clear picture of how today’s leading systems perform when answering direct questions.
Measuring Accuracy and Hallucination Rates
Artificial Analysis evaluates both proprietary and open weights models through live API endpoints. Their measurements reflect what users experience in actual deployments rather than theoretical performance. Accuracy shows how often a model produces correct answers. Hallucination rate captures how often it responds incorrectly when it should refuse or indicate uncertainty. Since new models launch frequently and providers adjust endpoints, these results can change over time, but the current snapshot still reveals clear trends.
Models With the Highest Hallucination Rates
| Model | Hallucination Rate |
|---|---|
| Claude 4.5 Haiku | 26% |
| Claude 4.5 Sonnet | 48% |
| GPT-5.1 (high) | 51% |
| Claude Opus 4.5 | 58% |
| Magistral Medium 1.2 | 60% |
| Grok 4 | 64% |
| Kimi K2 0905 | 69% |
| Grok 4.1 Fast | 72% |
| Kimi K2 Thinking | 74% |
| Llama Nemotron Super 49B v1.5 | 76% |
| DeepSeek V3.2 Ex | 81% |
| DeepSeek R1 0528 | 83% |
| EXAONE 4.032B | 86% |
| Llama 4 Maverick | 87.58% |
| Gemini 3 Pro Preview (high) | 87.99% |
| Gemini 2.5 Flash (Sep) | 88.31% |
| Gemini 2.5 Pro | 88.57% |
| MiniMax-M2 | 88.88% |
| GPT-5.1 | 89.17% |
| Qwen3 235B A22B 2507 | 89.64% |
| gpt-oss-120B (high) | 89.96% |
| GLM-4.6 | 93.09% |
| gpt-oss-20B (high) | 93.20% |
When it comes to hallucination, the gap between models is striking. Claude 4.5 Haiku has the lowest hallucination rate in this group at 26 percent, yet even this relatively low figure indicates that incorrect answers are common. Several models climb sharply from there. Claude 4.5 Sonnet reaches 48 percent, GPT-5.1 (High) 51 percent, and Claude Opus 4.5 58 percent. Grok 4 produces incorrect responses 64 percent of the time, and Kimi K2 0905 rises to 69 percent. Beyond these, models enter the seventies and eighties. Grok 4.1 Fast shows a 72 percent rate, Kimi K2 Thinking 74 percent, and Llama Nemotron Super 49B v1.5 76 percent. DeepSeek benchmarks show even higher rates, with V3.2 Ex at 81 percent and R1 0528 at 83 percent. Among the highest are EXAONE 4.032B at 86 percent, Llama 4 Maverick at 87.58 percent, and several Gemini models including 3 Pro Preview (High) and 2.5 Flash (Sep) exceeding 87 percent. GLM-4.6 and gpt-oss-20B (High) top the chart at over 93 percent. This spread demonstrates that while some models are relatively restrained, many generate incorrect answers frequently, making hallucination a major challenge for AI systems today.
Top Performers in Accuracy
| Model | Accuracy |
|---|---|
| Gemini 3 Preview (High) | 54% |
| Claude Opus 4.5 | 43% |
| Grok 4 | 40% |
| Gemini 2.5 Pro | 37% |
| GPT-5.1 (High) | 35% |
| Claude 4.5 Sonnet | 31% |
| DeepSeek R1 0508 | 29.28% |
| Kimi K2 Thinking | 29.23% |
| GPT-5.1 | 28% |
| Gemini 2.5 Flash (Sep) | 27% |
| DeepSeek V3.2 Exp | 27% |
| GLM-4.6 | 25% |
| Kimi K2 0905 | 24% |
| Llama 4 Maverick | 24% |
| Grok 4.1 Fast | 23.50% |
| Qwen3 235B A22B 2507 | 22% |
| MiniMax-M2 | 21% |
| Magistral Medium 1.2 | 20% |
| gpt-oss-120B (High) | 20% |
| Claude 4.5 Haiku | 16% |
| Llama Nemotron Super 49B v1.5 | 16% |
| gpt-oss-20B (High) | 15% |
Accuracy presents a different picture. Gemini 3 Preview (High) leads the pack at 54 percent, meaning it correctly answers just over half of all questions, followed by Claude Opus 4.5 at 43 percent and Grok 4 at 40 percent. Gemini 2.5 Pro comes next with 37 percent, while GPT-5.1 (High) reaches 35 percent and Claude 4.5 Sonnet 31 percent. A cluster of models then falls into the upper to mid-twenties: DeepSeek R1 0508 at 29.28 percent, Kimi K2 Thinking at 29.23 percent, GPT-5.1 at 28 percent, and both Gemini 2.5 Flash (Sep) and DeepSeek V3.2 Exp at 27 percent. The remaining models descend to GLM-4.6 at 25 percent, Kimi K2 0905 and Llama 4 Maverick at 24 percent, and EXAONE 4.032B at 13 percent. The spread highlights that even the top-performing models answer fewer than six out of ten questions correctly, showing the inherent difficulty AI faces in delivering consistently reliable responses across a broad set of prompts.
Clear Trade-offs
The contrast between hallucination and accuracy charts shows that strong accuracy does not guarantee low hallucination. Some high-ranking models in accuracy still produce incorrect answers at significant rates. Others deliver lower accuracy yet avoid the highest hallucination levels. These gaps illustrate how unpredictable model behavior remains, even as systems improve.
Read next: ChatGPT Doubles Usage as Google Gemini Reaches 40 Percent
by Irfan Ahmad via Digital Information World


No comments:
Post a Comment