Apple's Bomb on Reasoning Models

Unfolding Apple's research paper, Illusion of Thinking (LLM vs LRM)

Paper Unfold breakdown of complex research papers into easy-to-understand pointers.

Paper by Apple

If you’ve ever been impressed by how an LLM explains a math problem or walks through a logical puzzle, Apple’s latest research might change the way you think about “AI reasoning.”

Their new paper breaks down what Large Reasoning Models (LRMs) do when they “think,” and whether they’re better than standard LLMs.

Here is the research paper.

Let’s dive in.

Why Benchmarks Are Misleading

Currently, most AI models are evaluated on mathematical and coding tasks. But many of these benchmarks are flawed:

  • they’re often contaminated (models might have seen parts of the benchmark during training).

  • they only evaluate final answers, not the process by which the model arrived at them.

Image from the Paper

To fix this, Apple researchers created controllable puzzle environments like Tower of Hanoi, Checker Jumping, and River Crossing, where they could:

  • adjust complexity in a clean, controlled way

  • track not just the answer but the entire reasoning trace

Three Rules of Performance

When comparing LRMs to normal LLMs using the same compute, three patterns emerged:

  1. Low complexity
    standard LLMs often outperform LRMs. They’re faster and more efficient when the problem is easy.

  2. Medium complexity
    LRMs take the lead, and their extended reasoning steps (like Chain-of-Thought and self-reflection) help.

  3. High complexity
    both models collapse. Accuracy drops to near zero. The models get overwhelmed, no matter how much “thinking” they do.

This suggests that there’s a sweet spot where reasoning helps, but it breaks down past a certain threshold.

4 factors to understand the paper’s argument:

  1. Scaling

  2. Overthinking

  3. Struggle with Algorithms

  4. Inconsistency

1 - Scaling

Here’s where it gets wild.

As task complexity increases:

  • you’d expect LRMs to use more tokens (reason harder).

  • but instead, they peak and then do less thinking as complexity increases.

It’s like their brains give up when the puzzle gets too hard.

2 - Overthinking

For simple tasks, LRMs often find the correct answer early, but then they keep going, exploring wrong paths and wasting tokens.

This leads to a paradox:

  • more reasoning doesn’t mean better performance.

  • it often leads to worse answers.

3 - Struggle With Algorithms

Even when you give the model the solution algorithm, it still struggles.

This means it’s not just a planning issue.

It’s a computation problem, the model can’t reliably:

  • follow steps

  • verify outcomes

  • manipulate symbols

like a traditional program would.

4 - Inconsistency

One model (Claude 3.7 Sonnet) could do 100 moves in the Tower of Hanoi but only 4 in the River Crossing, even when the River puzzle required fewer moves.

Why?

Probably differences in training data, contamination, or just weak generalization.

My Take for AI Engineers and Leaders:

LRMs simulate reasoning, but they’re far from general-purpose thinkers.

Reasoning at easy level is amazing than having no reasoning model at all.

1 > 0

They hit a wall as complexity increases, and their reasoning often becomes inefficient or even counterproductive.

  • be careful assuming LRMs reason good, they do just for the easy level

  • designing better evaluation methods, not just benchmarks, is key

  • we're still in the early days of true reasoning in AI

Until next time.

Stay curious and keep building!

Happy AI

Check the Post by the AI Engineer HQ member - Here

AI Engineering vs ML Engineering

How satisfied are you with today's Newsletter?

This will help me serve you better

Login or Subscribe to participate in polls.

PS: Reply to this email if you want me to write on the topic you are interested in.

Reply

or to participate.