Friday, April 11, 2025

Top AI Models Fail Simple Debugging Test — Human Coders Still Reign Supreme

According to a new study by Microsoft Research, AI models are still struggling to fix software bugs that can easily be handled by skilled developers. AI is now being widely used for different tasks with companies like Google and Meta using it for programming and coding. But they are failing when it comes to fixing software bugs, with models like OpenAI’s o3-mini and Anthropic’s Claude 3.7 Sonnet failing a code benchmark called SWE-bench Lite. This shows that AI models are still not able to replace human programmers and developers.

The authors gave 300 software debugging tasks from the SWE-bench Lite to nine AI models, and the results showed that even the strongest and latest AI models couldn't complete even half of the tasks. Claude Sonnet 3.7 was the best-performing model but with only a 48.4% success rate, followed by o1 and o3-mini with success rates of 30.2% and 22.1% respectively.


This has made the authors and experts question why these AI models are performing poorly. The researchers say that the main issue with these AI models was the lack of training data and they weren't able to see actual examples of how humans debug software. The authors suggest that if we want to improve the performance of AI models, we have to train them on specialized and detailed data. Many studies have already shown that there are many logic errors and security flaws in codes generated by AI.

Read next: Greenpeace Study Reveals an Increase in Global Emissions Because of Production of AI Chips
by Arooj Ahmed via Digital Information World

No comments:

Post a Comment

This website attempted to run a cryptominer in your browser. Click here for more information.