Mr Branding: AI Models Struggle with Historical Accuracy, GPT-4 Turbo Only Scores 46%

Friday, January 24, 2025

AI Models Struggle with Historical Accuracy, GPT-4 Turbo Only Scores 46%

According to a new study, many AI models don't answer accurately about world history which is a very concerning matter. The researchers of the study developed some answer questions using benchmarks from Seshat Global History Databank and found that GPT-4 Turbo was able to score 46% in a test, which is better than guessing but not expert-level. The team of researchers transformed the data from the databank into multiple choice questions about different historical features.

Seven different AI models like LLama, GPT-3.5, Gemini and GPT-Turbo were tested and they were asked to act like expert historians so that their strengths and weaknesses can be evaluated and suggestions about improvements can be made. The researchers also made a scale for accuracy of answers, with 25% score given to random guessed answers and 100% score given to perfectly accurate answers. The AI models were also evaluated on the basis of answers with evidence and answers after drawing random conclusions.

GPT-Turbo was the best performing model with a score of 43.8% but it couldn't answer accurately on an expert level. In a two-choice test where the answer was either ‘present’ or ‘absent’, GPT-Turbo scored 63.2% which indicates that it can handle basic factual questions but is unable to answer complex historical questions. The study also found AI models’ performances based on different regions, time period and time of historical data. AI models performed better in questions about earlier historical periods like before 3000 BCE but struggled in questions about modern data because of complexities in societies. AI models also showed better performances in answering questions about Americans while they showed poor performances in answering questions about Oceania and Sub-Saharan Africa.

There are some limitations in the study too like the Seshat Databank being in English and only biased towards well documented societies as well as a limited set of AI models. This study shows that AI still has a long way to go in answering historical data and more unbiased and inclusive training data is needed for AI to talk about global history more accurately.

Image: DIW-Aigen

Source: NeurIPS

Read next: Study Links Short-Form Video Consumption to Poor Academic Performance in Children
by Arooj Ahmed via Digital Information World

Mr Branding

Friday, January 24, 2025

AI Models Struggle with Historical Accuracy, GPT-4 Turbo Only Scores 46%

No comments:

Post a Comment