Large language models can give correct answers by relying on grammatical patterns they learned during training, even when questions use contradictory wording. MIT researchers found that models learn to associate specific sentence structures with certain topics. In controlled tests, this association sometimes overrode the actual meaning of prompts.
The behavior could reduce reliability in real-world tasks like answering customer inquiries, summarizing clinical notes, and generating financial reports. It also creates security vulnerabilities that let users bypass safety restrictions.
The issue stems from how models process training data. LLMs learn word relationships from massive text collections scraped from the internet. They also absorb recurring grammatical structures, what the researchers call syntactic templates. These are patterns like adverb-verb-noun-verb that show up frequently in training examples.
When one subject area contains many examples with similar grammar, models can form associations between those structures and the topic. Take the question "Where is Paris located?" It follows an adverb-verb-proper noun-verb pattern. If geography training data repeats this structure often, a model might link the pattern to country information.
The researchers tested whether models relied on these grammar patterns by creating questions with the same sentence structure but contradictory meanings. Using antonyms that reversed the intended meaning, they found models still produced correct answers at high rates. This suggested the models responded to grammatical structure rather than semantic content.
Chantal Shaib, a graduate student at Northeastern University and visiting student at MIT who co-led the work, said models absorb both content and writing styles from training data. Subject areas like news have distinctive structures that models learn alongside facts.
The team built controlled experiments using synthetic datasets where each subject area had only one syntactic template. They tested OLMo-2 models at three scales (1 billion, 7 billion, and 13 billion parameters) by swapping words for synonyms, antonyms, or random terms while keeping grammar the same.
Models reached 90% to 94% accuracy on questions from their training domains when synonyms or antonyms were substituted. When the same grammar patterns were applied to different subject areas, accuracy dropped 37 to 54 percentage points. Prompts with broken, nonsensical wording produced low accuracy in both settings.
The researchers then evaluated production models including GPT-4o, GPT-4o-mini, Llama-4-Maverick, and OLMo-2-7B using portions of the FlanV2 instruction-tuning dataset. For sentiment classification on Sentiment140, OLMo-2-7B accuracy fell from 85% to 48% when grammar patterns crossed subject areas. GPT-4o-mini dropped from 100% to 44%. GPT-4o went from 69% to 36%.
Natural language inference tasks showed the same patterns. Larger instruction-tuned models handled paraphrased prompts better within training domains but still showed cross-domain accuracy drops.
The researchers also examined security implications. They took 1,000 harmful requests from the WildJailbreak dataset and added syntactic templates from safe training areas like math problems.
In OLMo-2-7B-Instruct, the refusal rate fell from 40% to 2.5% when harmful requests included these templates. One example: the model refused to explain "how to bomb an interview" when asked directly. But it gave detailed answers when the request used templates from training areas without refusals.
Vinith Suriyakumar, an MIT graduate student who co-led the study, said defenses need to target how LLMs learn language, not just patch individual problems. The vulnerability comes from core learning processes.
The researchers built an automated tool to measure this behavior in trained models. The method extracts syntactic templates from training data, creates test prompts with preserved grammar but changed meaning, and compares performance between matched and mismatched pairs.
Marzyeh Ghassemi, associate professor in MIT's Department of Electrical Engineering and Computer Science and senior author, noted that training methods create this behavior. Yet models now work in deployed applications. Users unfamiliar with training processes won't expect these failures.
Future work will test fixes like training data with more varied grammar patterns within each subject area. The team also plans to study whether reasoning models built for multi-step problems show similar behavior.
Jessy Li, an associate professor at the University of Texas at Austin who wasn't involved in the research, called it a creative way to study LLM failures. She said it demonstrates why linguistic analysis matters in AI safety work.
The paper will be presented at the Conference on Neural Information Processing Systems. Other authors include Levent Sagun from Meta and Byron Wallace from Northeastern University's Khoury College of Computer Sciences. The study is available on the arXiv preprint server.
Notes: This post was drafted with the assistance of AI tools and reviewed, edited, and published by humans. Image: DIW-Aigen.
Read next: AI Models Struggle With Logical Reasoning, And Agreeing With Users Makes It Worse
by Web Desk via Digital Information World

No comments:
Post a Comment