Thursday, August 7, 2025

AI Language Models Show Brain-Like Understanding of Visual Scenes

A group of researchers has shown that advanced language models can reflect how the human brain interprets what it sees. The work used detailed brain scans and computer models to study how people understand complex, real-world scenes. It combined neuroscience with machine learning to find patterns linking brain activity to artificial intelligence outputs.

Participants viewed thousands of images while lying in a 7-Tesla MRI scanner. The images came from a public photo database and included varied situations such as street views, people at work, animals in their habitats, and objects in familiar places. Each picture had several human-written captions describing its content.

Matching AI to Brain Signals

The captions were processed through MPNet, a large language model designed to turn sentences into compact numerical representations called embeddings. These embeddings were compared with brain activity patterns using a method known as representational similarity analysis. In higher-level visual areas, the AI-generated patterns aligned closely with the human brain’s responses.

A second test used the brain data to predict the language model embeddings, then matched these predictions to a large library of captions. This allowed the researchers to reconstruct short textual descriptions of the images people had seen.

Importance of Full Context

Models that used only lists of objects, single words, or limited parts of speech showed weaker alignment with brain activity. Embeddings created from full sentences performed best, suggesting that combining all the information in a caption is important for reflecting how the brain processes meaning.

Building Vision Models from Language

The team also trained recurrent convolutional neural networks to take images as input and predict the language model embeddings for their captions. These networks matched brain responses more closely than many top computer vision systems, despite being trained on far fewer images. When directly compared to otherwise identical networks trained only to classify objects, the language-trained versions produced richer internal representations that explained brain activity better.

Interpreting the Results

The results suggest that the brain may convert visual information into a high-dimensional form similar to how a language model encodes text meaning. This process appears to keep both object details and the wider relationships between elements in a scene.

Future Use

The researchers say this approach could help design AI systems that see and interpret more like people do. It may also lead to improvements in brain–computer interfaces and visual aids for people with sight loss. The work offers a possible common framework for studying complex meaning in the brain by connecting insights from vision research, computational modelling, and language processing.

Read next: OpenAI launches GPT-5, a unified stack that blends fast chat and deep reasoning


by Asim BN via Digital Information World

No comments:

Post a Comment