YOU ARE AT:AI-Machine-LearningWork-shy AI collapses under questioning – Apple debunks AGI myth

Work-shy AI collapses under questioning – Apple debunks AGI myth

New research from Apple casts serious doubts about excitable claims about sentient machines. Results suggest large reasoning models collapse under pressure – challenging the AGI concept, and exposing AI industry overreach.

In sum – what to know:

Total collapse – new Apple study shows frontier AI models fail completely on high-complexity, multi-step reasoning tasks.
Work shy – so-called large reasoning models mimic patterns, not logic; they generate less thought as problems get harder.
Hype busting – fundamental limits in transformer-based AI undermine claims about AGI from tech’s biggest hype merchants.

The AI hype bubble has truly burst – if you believe the expert cranks and crazies on social media (bless you all). Or are they just experts? Sometimes it’s hard to tell. But this echo-chamber narrative about AI’s limitations has gathered a fierce head of steam – and all the sudden noise is like a rising chorus line for a generalist old hack, paid to retain proper skepticism, if not cynicism, in the face of such grotesque and unprecedented AI clamour.

Because there are two entrenched sides now: the one that says AI will quickly control everything, and the one that says it is useful but problematic, and will be ultimately directed to help with certain tasks. As always, the truth is probably somewhere between – in lots of ways, and for lots of reasons. But for one thing, Apple has just shown AGI as a pipedream – by testing the leading AI highs, and showing their cracked reality. 

Indeed, it turns out that the smartest AI models out there are just glorified pattern-matching systems, which fold under pressure. In other words, top-end frontier models – notably, the latest ‘large reasoning models’ (LRMs) from Anthropic and DeepSeek (the ‘thinking’ versions of their Claude-3.7-Sonnet and R1/V3 systems); plus OpenAI’s o3-mini – don’t really ‘reason’ for themselves; instead they just fake it by mimicking patterns they’ve seen in training.

When faced with genuinely novel and complex problems, which require structured logic or multi-step planning, they break down completely. They are fine with medium-complexity tasks, even exhibiting rising ‘smartness’ to a point; but they are less good than standard large language models (LLMs) at low-complexity tasks, and they fail completely at high-complexity tasks, crashing to zero accuracy – even with set instructions in hand-coded algorithms. 

This is the stark conclusion of a new research paper by Apple, which investigates the reasoning capabilities of advanced LLMs, called LRMs, through controlled mathematical and puzzle experiments, and asks whether their recent improvements come from better reasoning, or from more exposure to benchmark data or greater computational effort. The result is that, when faced with complex problems, they basically give up the ghost.

When the going gets tough, the tough get totally flummoxed – it turns out. Apple writes: “Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds… Standard LLMs outperform LRMs at low complexity, LRMs excel at moderate complexity, and both collapse at high complexity.” 

LRMs remain “insufficiently understood”, writes Apple. Most evaluations of them have focused on standard coding benchmarks, it argues, which emphasise their “final answer accuracy”, rather than their internal chain-of-thought (CoT) (step-by-step) “reasoning traces” (processes), calculated in ‘thinking tokens’ to go between questions and answers. As such, Apple proposed four puzzle environments to allow fine-grained control over problem complexity.

By systematically increasing the difficulty of these new-style puzzles (Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World; see the paper for their description) and comparing the responses from from both reasoning LRM and non-reasoning LLM engines, it found “frontier LRMs face a complete accuracy collapse beyond certain complexities”. More than this, their scaling mechanisms get screwy, and their logic explodes.

Apple writes: “They exhibit a counter-intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget… LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles.” The idea is that if a model is truly ‘reasoning’, then harder problems should result in more detailed chains of thought – and more tokens. 

But the Apple study found the opposite: that, as the complexity of their tasks increase, these high-end models use fewer tokens – and ultimately try less, and then just give up. “As problems approach critical difficulty, models paradoxically reduce their reasoning effort despite ample compute budgets. This hints at intrinsic scaling limits in current thinking approaches.” Apple calls this finding “particularly concerning”.

The research also suggests a machine version of clever-person procrastination. The part about reasoning traces says LLMs ‘overthink’ simple problems – by finding correct solutions and then wasting compute to explore incorrect ones – and make a total hash (“complete failure”) of complex ones. The results challenge prevailing LRM ideas, writes Apple, and suggest “fundamental barriers to generalizable reasoning” – and to the whole AGI shtick, therefore.

There is a sense, of course, that Apple is late to the AI game – at least, versus the likes of OpenAI, Google DeepMind, Anthropic, and Meta; certainly, it has been quieter in the generative AI and LLM race. As such, there is an argument, perhaps, that its new paper, part of a more recent push to establish credibility in AI research, might be a subtle critique of the industry’s hype that mixes scientific caution and strategic positioning.

But the anti-AGI mob (or anti AI-BS mob) – mostly discussing proper AI capabilities, quickly growing louder – has embraced the paper as further proof that the AI hype machine has over-reached itself, in a guff of capitalist bombast, obfuscation, and mistrust, and faces a late reckoning. (As evidenced, they say, by: the indefinite delay of OpenAI’s self-proclaimed GPT-5 system; humans replacing AI workers at Klarna and Duolingo; fake AI at BuilderAI; lots of other stuff.

ABOUT AUTHOR

James Blackman
James Blackman
James Blackman has been writing about the technology and telecoms sectors for over a decade. He has edited and contributed to a number of European news outlets and trade titles. He has also worked at telecoms company Huawei, leading media activity for its devices business in Western Europe. He is based in London.