- the master
- Posts
- Apple's Bomb on Reasoning Models
Apple's Bomb on Reasoning Models
Unfolding Apple's research paper, Illusion of Thinking (LLM vs LRM)
Paper Unfold breakdown of complex research papers into easy-to-understand pointers.

Paper by Apple
If you’ve ever been impressed by how an LLM explains a math problem or walks through a logical puzzle, Apple’s latest research might change the way you think about “AI reasoning.”
Their new paper breaks down what Large Reasoning Models (LRMs) do when they “think,” and whether they’re better than standard LLMs.
Here is the research paper.
Let’s dive in.
Why Benchmarks Are Misleading
Currently, most AI models are evaluated on mathematical and coding tasks. But many of these benchmarks are flawed:
they’re often contaminated (models might have seen parts of the benchmark during training).
they only evaluate final answers, not the process by which the model arrived at them.

Image from the Paper
To fix this, Apple researchers created controllable puzzle environments like Tower of Hanoi, Checker Jumping, and River Crossing, where they could:
adjust complexity in a clean, controlled way
track not just the answer but the entire reasoning trace
Three Rules of Performance
When comparing LRMs to normal LLMs using the same compute, three patterns emerged:
Low complexity
standard LLMs often outperform LRMs. They’re faster and more efficient when the problem is easy.Medium complexity
LRMs take the lead, and their extended reasoning steps (like Chain-of-Thought and self-reflection) help.High complexity
both models collapse. Accuracy drops to near zero. The models get overwhelmed, no matter how much “thinking” they do.
This suggests that there’s a sweet spot where reasoning helps, but it breaks down past a certain threshold.
4 factors to understand the paper’s argument:
Scaling
Overthinking
Struggle with Algorithms
Inconsistency
1 - Scaling
Here’s where it gets wild.
As task complexity increases:
you’d expect LRMs to use more tokens (reason harder).
but instead, they peak and then do less thinking as complexity increases.
It’s like their brains give up when the puzzle gets too hard.
2 - Overthinking
For simple tasks, LRMs often find the correct answer early, but then they keep going, exploring wrong paths and wasting tokens.
This leads to a paradox:
more reasoning doesn’t mean better performance.
it often leads to worse answers.
3 - Struggle With Algorithms
Even when you give the model the solution algorithm, it still struggles.
This means it’s not just a planning issue.
It’s a computation problem, the model can’t reliably:
follow steps
verify outcomes
manipulate symbols
like a traditional program would.
4 - Inconsistency
One model (Claude 3.7 Sonnet) could do 100 moves in the Tower of Hanoi but only 4 in the River Crossing, even when the River puzzle required fewer moves.

Why?
Probably differences in training data, contamination, or just weak generalization.
My Take for AI Engineers and Leaders:
LRMs simulate reasoning, but they’re far from general-purpose thinkers.
Reasoning at easy level is amazing than having no reasoning model at all.
1 > 0
They hit a wall as complexity increases, and their reasoning often becomes inefficient or even counterproductive.
be careful assuming LRMs reason good, they do just for the easy level
designing better evaluation methods, not just benchmarks, is key
we're still in the early days of true reasoning in AI
Until next time.
Stay curious and keep building!
Happy AI
Check the Post by the AI Engineer HQ member - Here
How satisfied are you with today's Newsletter?This will help me serve you better |
PS: Reply to this email if you want me to write on the topic you are interested in.
Reply