Tuesday, August 26, 2025

Checklist Method Shows Promise for Improving Language Models

A joint team of researchers from Apple and Carnegie Mellon University has proposed a new way to improve how large language models follow instructions, showing that a simple checklist system can outperform traditional reward-based training in several benchmarks.

Moving Beyond Reward Models

Most current models are refined after training with a process known as reinforcement learning from human feedback. In that setup, annotators evaluate model responses with broad judgments such as “good” or “bad,” and these ratings become the guide for fine-tuning. While this approach helps align systems with human expectations, it has well-known limitations. Models can learn to produce text that looks correct on the surface without truly meeting the request, and the reward signals are often too vague to capture the full range of user needs.

The new study suggests that a more structured form of feedback may work better. Instead of relying on a single score, the researchers created instruction-specific checklists that break down requests into a series of concrete yes-or-no items. Each response is then judged against these criteria, and the combined score becomes the basis for reinforcement learning.

Building Checklists at Scale

To test this idea, the team introduced a method called Reinforcement Learning from Checklist Feedback, or RLCF. They built a dataset named WildChecklists, covering 130,000 instructions, by asking a large teacher model to generate both candidate responses and detailed checklists. Each checklist was weighted to reflect the importance of different requirements, and responses were scored with the help of both model-based judges and small verification programs for tasks that could be checked automatically.

This approach means that instead of asking whether an answer is broadly useful, the system evaluates whether specific elements of the instruction are satisfied — for example, whether a translation really appears in Spanish, or whether a generated sentence uses a required keyword. The researchers found that this reduced the chance of reward hacking, where models exploit loopholes in feedback systems without genuinely improving.

Benchmark Gains and Trade-offs

The method was tested on five established benchmarks that measure instruction following and general-purpose assistance. Across FollowBench, InFoBench, IFEval, AlpacaEval, and Arena-Hard, RLCF produced consistent gains, including an 8.2% improvement in constraint satisfaction on FollowBench and notable increases in win rates for general conversational tasks. In contrast, traditional reward model approaches showed mixed results, with improvements on some tests but regressions on others.

Importantly, the checklist approach was especially effective for instructions that included multiple constraints, such as style, content, or formatting requirements. By breaking tasks into smaller checks, the system was better at attending to the full prompt rather than focusing on only part of it.

Limitations and Future Directions

Despite these improvements, the researchers highlighted several constraints. The approach relies on a much larger model to act as a teacher for smaller models, which raises questions about efficiency and accessibility. Generating checklist-based judgments is also computationally expensive, though the team showed that sampling fewer scores could cut costs without a large drop in accuracy.


Another limitation is scope: RLCF was designed to improve complex instruction following, not to handle issues of safety or misuse. Reward models and other techniques will still be required for those areas.

Broader Implications

As language models take on a bigger role in everyday digital tasks, their ability to follow multi-step and nuanced instructions becomes increasingly important. The checklist-based method provides a more interpretable and targeted way to measure progress, suggesting that alignment techniques need not be limited to coarse feedback signals.

By showing that a straightforward checklist can guide models more effectively than some of today’s sophisticated reward systems, the study opens a path for future work that combines structured evaluation with scalable reinforcement learning.

Read next: Google Removes Malicious Play Store Apps Infecting Millions With Trojans


by Web Desk via Digital Information World

Musk’s xAI Drags Apple and OpenAI Into Court Over AI Bias Claims

Elon Musk has turned another corner in his fight with OpenAI, this time pulling Apple into the dispute. His company xAI, which also owns the social platform X, filed a lawsuit in Texas accusing the two tech giants of running a setup that sidelines competitors in the chatbot market. The complaint points to Apple’s close partnership with OpenAI and the way its App Store ranks and reviews software.

Grok Left in the Shadows

The complaint centers on Grok, the chatbot built by xAI. Musk’s lawyers argue it doesn’t get a fair chance to reach iPhone users. They say Apple’s store review process slows down rivals, that curated lists spotlight OpenAI’s ChatGPT more often, and that search rankings quietly push Grok down. For a service still trying to gain traction, visibility is everything. The suit claims Apple’s actions cut that off.

Why Prompt Volume Matters

The case isn’t just about screen space. It drills into how chatbots learn. More prompts from users mean more training data. More data means faster improvement. By directing Apple’s massive customer base toward ChatGPT, the argument goes, OpenAI keeps accelerating while Grok is left behind. The complaint ties that gap directly to revenue and innovation, saying fewer prompts don’t just stunt growth, they keep the system weaker than it should be.

Apple’s Hold on Smartphones

There’s a broader point too. Musk’s filing links the issue to Apple’s place in the smartphone market. One Apple executive had acknowledged during another court battle that AI could one day make people less reliant on iPhones. xAI claims Apple knows that risk and is trying to slow it by favoring one partner, OpenAI, and denying access to others who might chip away at its hold on mobile devices.

Requests That Went Nowhere

The lawsuit notes that xAI asked Apple to let Grok plug directly into iOS, in the same way ChatGPT was folded into “Apple Intelligence.” That request, according to the filing, was turned down. Google’s Gemini has been mentioned by Apple leaders as a possible option in the future, yet so far only OpenAI has been granted deep integration.

Pushback From Apple and OpenAI

Apple has rejected claims of bias before, pointing out that its app store hosts thousands of AI apps ranked through algorithms and human editors. OpenAI has dismissed Musk’s repeated complaints as part of a campaign of lawsuits and public attacks stretching back to his exit from the company in 2018.

A Long Rivalry Gets Sharper

For Musk, this isn’t a new fight. He co-founded OpenAI nearly ten years ago, split with the team, and has been clashing with them ever since. He has already sued over OpenAI’s shift from nonprofit ideals to commercial partnerships. Now, with Grok in the market as a direct rival to ChatGPT, the focus has shifted to Apple’s role as gatekeeper. Whether courts agree with Musk that Apple and OpenAI are acting like monopolists is still an open question.


Notes: This post was edited/created using GenAI tools. Image: DIW-Aige

Read next: The World’s 100 Most Valuable Private Companies in 2025
by Irfan Ahmad via Digital Information World

Monday, August 25, 2025

WhatsApp Adds Option to Leave Voice Message After Missed Calls

WhatsApp has been testing different ways to help people manage calls they miss. Earlier versions introduced reminders that showed up later with the caller’s name, profile picture, and a direct link back to the chat. That update made it easier to follow up, especially if the call came at a bad time.

Now the app is moving further. In the latest Android beta, some users, as per WBI, are seeing a new option that lets them record a voice message when a call goes unanswered. The prompt shows up at the bottom of the screen right after the missed call. It also appears inside the chat where the call is logged, which means the person calling doesn’t need to search for the conversation before sending a reply.

Works Like a Voicemail, But Simpler


The feature is close to voicemail in how it functions, though it stays inside WhatsApp’s own messaging system. Instead of calling back later or typing a note, the caller can leave a short recording on why they were calling. The recipient then gets both the missed call alert and the message in the same thread, ready to play when they have time.

A Useful Shortcut

The change may help in everyday situations. Someone trying to reach a colleague stuck in a meeting, for example, can quickly explain the reason for the call without waiting for another chance to connect. It is faster than drafting a text and serves as a reminder tied to the missed call itself. Regular voice notes in chats are still available, but this new shortcut makes the process quicker in moments where timing matters.

Gradual Rollout for Testers

At the moment, the option is showing up only for selected beta testers on Android who have installed the most recent update from the Play Store. WhatsApp is expanding access gradually, so more users should see the feature appear in the coming weeks.

Read next: Benchmarking AI with MCP-Universe Shows Limits of GPT-5 and Other Models
by Asim BN via Digital Information World

Sunday, August 24, 2025

Benchmarking AI with MCP-Universe Shows Limits of GPT-5 and Other Models

Salesforce AI Research has introduced a new benchmark that puts large language models through tasks tied to the Model Context Protocol, the fast-growing standard designed to link AI systems with outside tools. Called MCP-Universe, the framework evaluates models against real servers instead of simulations, and its first round of results shows that even the most advanced systems are far from dependable when asked to work in real-world enterprise settings.

The benchmark covers six domains: navigation, repository management, financial analysis, 3D design, browser automation, and web searching. Within those areas sit 231 tasks, split across 11 live servers, ranging from Google Maps and GitHub to Yahoo Finance, Blender, Playwright, and Google Search. Each domain has its own set of sub-tasks, such as route planning in maps, portfolio analysis in finance, or object creation in 3D modeling, with complexity increasing as models are forced to use multiple steps and maintain information over longer contexts.

Instead of relying on a language model to judge another’s output, which has been common in past benchmarks, MCP-Universe measures success by execution. That means checking if a model formats answers correctly, whether it produces consistent results over time, and if it can work with data that changes. A separate set of evaluators handles each dimension: format evaluators for strict compliance, static evaluators for timeless facts like historical stock prices, and dynamic evaluators that pull real-time ground truth for shifting data such as live market movements or flight fares.

The test results reveal a wide gap between model hype and operational performance. GPT-5 led all systems, but its overall success rate stood at just 43.7 percent. It showed strength in financial analysis, completing two-thirds of those tasks, and performed above 50 percent in 3D design, but it failed more often than not in navigation and browser automation. Grok-4 followed at 33.3 percent, then Claude-4.0 Sonnet at 29.4 percent. The best open-source option, GLM-4.5, reached 24.7 percent, ahead of some proprietary systems but still far behind the leaders.

Looking deeper, the evaluator breakdown shows another layer of fragility. On format checks, most models scored high, with Claude-4.0 near 98 percent compliance, suggesting they can follow rules when tightly defined. But when asked to produce content against static or live-changing data, success dropped to the 40–60 percent range. GPT-5 again led in dynamic cases with 65.9 percent, but that still meant failure in more than a third of scenarios where up-to-date information was required.

Task efficiency also varied. GPT-5 needed on average just over eight steps to succeed, Grok-4 about 7.7, while smaller models like o3 could finish in under five but with less reliability. That trade-off between speed and accuracy highlights how fragile multi-step reasoning remains, especially in domains with long context chains. The context growth was most obvious in maps, browser automation, and finance, where server outputs return large blocks of data. Summarization experiments, meant to shorten context, brought mixed outcomes: slight gains in navigation but losses elsewhere, showing that compression alone does not solve the memory problem.

Another recurring failure came from unfamiliar tools. In some cases, models called functions incorrectly or set parameters in ways that broke execution. One example involved the Yahoo Finance server, where stock price queries require two distinct dates; models often set them the same, leading to errors. Salesforce tested an exploration phase, letting models experiment with tools before running tasks, and saw partial gains — GPT-4.1 improved slightly in browser automation and Claude in finance — but the fix did not carry across all domains.

The benchmark also looked at how frameworks influence outcomes. Comparing agent backbones, the ReAct setup generally outperformed Cursor, despite Cursor being designed as an enterprise agent. ReAct achieved higher overall success with Claude-4.0, while Cursor only excelled in isolated areas like browser automation. With OpenAI’s o3 model, the company’s own Agent SDK produced stronger results than ReAct, particularly in finance and design, suggesting that framework-model pairings can alter performance as much as raw model size.

Adding unrelated MCP servers made tasks even harder. When models had to deal with more tools than necessary, performance dropped sharply. In location navigation, for example, Claude-4.0 fell from 22 percent success to 11 percent once extra servers were included. The decline highlights how easily noise can destabilize tool orchestration, a problem that enterprises will need to address as they scale up.

For all the variety of tests, the conclusion is consistent. Current models, even GPT-5, can handle isolated reasoning or simple calls, but when placed into real environments with shifting data, long contexts, and unfamiliar tool sets, they still fail most of the time. MCP-Universe exposes those gaps more clearly than past benchmarks, offering a way to measure progress as researchers try to close them. For companies deploying AI at scale, the results point to a hard truth: building reliable agents will depend not just on bigger models but also on smarter frameworks, better context handling, and stronger safeguards around tool use.


Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.

Read next: LLMs Struggle with Reasoning Beyond Training, Study Finds
by Irfan Ahmad via Digital Information World

Saturday, August 23, 2025

LLMs Struggle with Reasoning Beyond Training, Study Finds

A new study from Arizona State University has questioned whether the step-by-step reasoning displayed by large language models (LLMs) is as reliable as it seems. The work argues that what appears to be careful logical thinking, often encouraged through Chain-of-Thought (CoT) prompting, may instead be a fragile form of pattern matching that collapses when tested outside familiar territory.

Why Chain-of-Thought Looks Convincing

CoT prompting has been widely adopted to improve performance on complex reasoning tasks. By asking models to explain their answers in stages, developers have found that outputs look structured and often reach correct solutions. This has led many to assume that models are carrying out a type of human-like reasoning. Yet the ASU team points out that the appearance of logic can be misleading. Their experiments show that models often weave together plausible explanations while still arriving at inconsistent or even contradictory conclusions.

One example in the paper shows a model correctly identifying that the year 1776 is divisible by four and therefore a leap year, yet it concludes in the very next step that it is not. Such slips reveal that the chain itself is not anchored in true inference but is instead shaped by statistical patterns learned during training.

A Data Distribution Lens

To test the limits of CoT, the researchers introduced what they call a data distribution lens. The central idea is that LLMs learn inductive biases from their training sets and generate reasoning chains that mirror those patterns. As long as new problems share structural similarities with what the model has seen before, performance is strong. But when the test data deviates, even slightly, the reasoning falls apart.

The group examined three kinds of distribution shift. The first was task generalization, where new problems required reasoning structures not present in the training data. The second was length generalization, which tested whether models could handle reasoning sequences that were longer or shorter than expected. The third was format generalization, where small changes in the way prompts were worded or structured were introduced.

DataAlchemy and Controlled Testing

To isolate these effects, the researchers built a controlled experimental framework called DataAlchemy. Rather than working with massive pre-trained models, they trained smaller models from scratch on synthetic datasets. This gave them precise control over how training and test data differed.

The findings were consistent. When tasks, sequence lengths, or prompt formats shifted beyond the training distribution, CoT reasoning deteriorated sharply. The models still produced chains that looked fluent and structured, but their accuracy collapsed. In some cases, they attempted to force the reasoning into the same length or shape as their training examples, even if this meant introducing unnecessary or incorrect steps.

The Mirage of Reasoning

Across all three tests, the study shows that CoT is less a method of reasoning than a sophisticated form of structured imitation. The researchers describe it as a mirage: convincing in appearance, but ultimately shallow. What seems like careful reasoning is better understood as interpolation from memorized examples.

The fragility was especially visible in the format tests. Even small, irrelevant changes to the structure of a prompt could derail performance. Similarly, when new task transformations were introduced, the models defaulted to the closest patterns seen during training, often producing reasoning steps that appeared logical but led to wrong answers.

Fine-Tuning as a Short-Term Fix

The team also explored whether supervised fine-tuning (SFT) could help. By adding just a small amount of data from the new, unseen distribution, performance improved quickly. However, the improvement only applied to that specific case. This suggested that fine-tuning simply extends the model’s training bubble slightly rather than teaching it more general reasoning skills.

Implications for Enterprise AI

The research warns developers not to treat CoT as a plug-and-play reasoning tool, especially in high-stakes applications such as finance, law, or healthcare. Because the outputs often look convincing, they risk projecting a false sense of reliability while hiding serious logical flaws. The study stresses three lessons for practitioners.

First, developers should guard against overconfidence and apply domain-specific checks before deploying CoT outputs in critical settings. Second, evaluation should include systematic out-of-distribution testing, since standard validation only shows how a model performs on tasks that resemble its training data. Third, while fine-tuning can temporarily patch weaknesses, it does not provide true generalization and should not be treated as a permanent solution.

A Path Forward

Despite its limitations, CoT can still be useful within well-defined boundaries. Many enterprise applications involve repetitive and predictable tasks, where pattern-matching approaches remain effective. The study suggests that developers can build targeted evaluation suites to map the safe operating zone of a model and use fine-tuning in a focused way to address specific gaps.

The findings underline the importance of distinguishing between the illusion of reasoning and actual inference. For now, CoT should be seen as a valuable but narrow tool, one that helps models adapt to familiar structures rather than a breakthrough in machine reasoning.


Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.

Read next:

Famine Declared in Gaza City as Israel Faces Global Criticism Over Aid Restrictions

• Y Combinator pushes back against Apple’s App Store fees in Epic Games case


by Irfan Ahmad via Digital Information World

Friday, August 22, 2025

Y Combinator pushes back against Apple’s App Store fees in Epic Games case

Y Combinator has stepped into the long-running legal dispute between Apple and Epic Games, urging the court to reject Apple’s latest appeal. The startup accelerator filed a supporting brief that argues Apple’s control of the App Store has held back innovation and made it harder for young companies to compete.

The legal fight over payment rules

Epic first sued Apple in 2020, challenging the iPhone maker’s practice of charging developers up to 30 percent on all purchases made through the App Store, including in-app transactions. The gaming firm also objected to rules that prevented developers from informing users about cheaper payment options outside the store.

Although a judge later ordered Apple to stop enforcing those restrictions, the company introduced a separate system that still allowed links to outside payment methods but kept a 27 percent service charge in place. Epic returned to court, arguing that Apple was sidestepping the injunction. Earlier this year, the judge agreed and directed Apple to end the practice of collecting fees on payments processed elsewhere. Apple is now appealing that decision.

Y Combinator’s stance

By filing its brief, Y Combinator has formally sided with Epic. The accelerator said that high platform fees discouraged investors from supporting app-based startups, since the costs could erase already slim margins and prevent companies from expanding or hiring. It argued that lowering these barriers would allow venture backers to fund businesses that were previously considered too risky.

Wider impact on startups

For investors like Y Combinator, the court’s current ruling could change the investment landscape. If upheld, developers would be free to point users to alternative payment methods without Apple taking a share. That shift could encourage more funding into mobile-first ventures, which have often struggled under the so-called Apple Tax.

What comes next

The appeals court will hear arguments on October 21. Until then, the order requiring Apple to allow outside payment options remains in effect. The outcome will not only affect Epic’s case but could also set a precedent for how platform operators handle transactions in digital marketplaces.


Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.

Read next: When “Cybernaut” Was Cool: 15 Internet Slang Terms That Didn't Last the Decade
by Asim BN via Digital Information World

When “Cybernaut” Was Cool: 15 Internet Slang Terms That Didn't Last the Decade

Whether you’re a Boomer who rarely uses text or a chronically online Gen Zer, the chances that you’ve used a slang term are probably pretty high. And if you’ve ever used a term and received an eye roll in return, chances are you reached for lingo that aged you rather than engaged you.

Just like the latest clothing fashion, slang trends come and go. But some slang gets so popular it actually lands in the dictionary. Unfortunately, not every word sticks around. Plenty fade out after a few years and quietly disappear from the official lists. And if you’re not up to date with your slang, you risk using dated language that builds walls rather than bridges for your communication.

If you’re a language learner you’ll want to read to see what words have been collectively shunned by American English speakers so that you stay ahead of the slang trends.

Slang 101: Quick, Quirky, and Evolving

Slang doesn’t just represent trends in language. It also has a lot of practical and fun uses. People lean on slang to keep conversations short and snappy—LOL and BRB are classics. And some slang is just fun to say, like stalkerazzo or crybully, while other terms, like sponcon, mash up words to efficiently describe something new (sponsored content).

Other slang words just, well, happen. Take cap, for instance, which has its roots in Atlanta and Memphis. Because cap refers to the upper limit of authenticity, saying no cap is basically the slang way of doubling down on honesty. Mid is slang for “mediocre,” and that’s exactly how people use it: to knock something that’s just plain average.

Regardless of how they originated, some slang words are just plain odd to use, especially if you’re not a native English speaker. But these words often create shorthand ways of getting your thoughts across, which make them incredibly useful. And because their origin is connected to current events and trends, slang reflects the evolution of speech and the English language.

Worn-Out Welcome: Slang Words That Didn’t Stay Cool

Over the years, dictionaries have added slang terms to their list of definitions. A study by Preply measured how relevant those words remain to English speakers today, and we’ll go through the top 15 that didn’t stay cool for English slang-users.

First place goes to stalkerazzo—a mashup of stalker and paparazzi. It once described celebrity-obsessed photographers, but most people just stuck with the originals and the word faded fast.

Declinist, crybully, and McJob take the second, third, and fourth spots, respectively. Declinist was for folks convinced their nation was headed downhill. These days, nobody says it outside of maybe a poli-sci lecture. The word’s relevancy score? Barely a blip at 17.98.

Crybully mixed the idea of crying victim with being a bully. It popped up online for people who weaponized victimhood, but it never really caught on outside of internet debates. McJob became shorthand for low-paying, dead-end work, an obvious jab at fast-food gigs. The term stuck around for a while but feels dated in today’s job market talk.

Words like cyberspeak, cybercitizen, and cybersurfer (in spots 5, 6, and 7) probably sound like a Geocities home page, and for good reason. In the ’90s, everything online needed a cyber in front of it. But that fad crashed along with dial-up. Cybernaut—another cyber-merge—landed in 11th place, once meant to describe anyone cruising the web. These days it sounds more like a forgotten sci-fi character than an actual internet user.

Number eight: defriend. Facebook made unfriend the standard, so this wannabe synonym never stood a chance. At nine we get verklempt, a Yiddish word for being choked up,which is lovely in theory, but its spelling and pronunciation scared people off. And rounding out the trio is Frankenfood. It was meant as a slam on genetically engineered meals, but the Frankenstein joke got old fast.

The next three places are words that refer to online content. The relevance scores of the following words range from 34 to 36, indicating that slang terms referring to social media outlets are changing as the technology itself advances:

  • Slacktivism grabbed 12th place and is aimed at people who share posts about causes but don’t lift a finger beyond that. The insult stuck for a bit but feels tired now.
  • Next is tweetstorm, once used for long rants broken into dozens of tweets. Since Twitter rebranded to X and threads became the norm, the word fizzled out.
  • Number 14 is sponcon, short for sponsored content. Influencer culture gave it a brief run, but newer platforms and shifting lingo have pushed it aside.

Last up: fatberg. It describes those nasty sewer clogs made of grease and junk. Great word if you’re a plumber, not so handy in casual conversation.

When to Use or Not Use Slang

You probably use slang often when texting or having informal conversations with your friends, especially since slang terms can refer back to funny jokes or current trends. Other slang terms make your life easier by abbreviating long words or blending words together to create a new word for something. However, you definitely need to know when to use or not use slang.

Dropping slang at work or in a serious setting usually doesn’t land well. Calling your manager’s great new idea, “mid,” in a meeting, for example, probably won’t score you points, and there’s always a chance people won’t know what you mean.

In more informal conversation, slang use can be appropriate. Some of your friends may not understand every slang term you use, but the more that you practice using slang, the better you’ll become at figuring out when to reference it. To keep it safe, keep your slang usage light in important conversation, even if the term you use has landed itself a spot in the dictionary.

Conclusion

English slang is a moving target. Yesterday’s hot term is today’s cringe. For learners, it’s a reminder that dictionaries can only tell you so much. Real practice, with real people, is where you figure out what still lands and what sounds awkward.

Read next: Hidden Risks of Passkeys Surface in Study on Abuse Scenarios


by Irfan Ahmad via Digital Information World