Is jagged intelligence just janky intelligence?
Improvements in reasoning models haven’t removed their vulnerability to confounding information or logical missteps — while claims that “AGI” is imminent keep coming.

What’s happening? The scientific equivalent of elevator music drones on, as top labs debate the prospects for imminent “artificial general intelligence.”
Companies and investors are riveted; mentions of “AGI” rose sharply in earnings calls at the start of this year, according to AlphaSense.
So what? While you wait, a new term for the roll-out of sometimes unreliable and inefficient software on the runway to AGI has emerged: “jagged” intelligence.
Why “jagged”? Today’s large language models have severe peaks and troughs of performance. In some cases, models can complete tasks much faster and more reliably than human researchers.
In others they make mistakes that a preschool kid would not. There are highs and lows, hence the “jagged” moniker.
How did we get here? In September 2024, OpenAI released o1-preview, the first of its “o-series” models purportedly capable of “reasoning” and “thinking.”
Three months later, OpenAI shared the benchmark results of its o3 model, with president Greg Brockman describing it as a “breakthrough.”
Reports that labs were struggling to build more advanced systems were quashed.
Since then, Google’s Gemini 2.5 models, Anthropic’s Claude 3.7 and 4, and xAI’s Grok 3 have all been pitted against OpenAI’s o-series, and a runway of sorts towards “AGI” has supposedly been cleared.
What do AI leaders claim? The leap to “reasoning” models turned up the volume on AGI rhetoric. Anthropic’s CEO, Dario Amodei, told Axios in May that artificial intelligence could eliminate half of all entry-level white-collar jobs in the next five years. Google DeepMind CEO Demis Hassabis is anticipating the arrival of AGI shortly after 2030.
The geopolitical dovetail. Still, the prospect of large language models yielding AGI has caught the attention of world leaders. We’ve discussed the growing geopolitical perception that it is imminent and will be transformative in Inferences.
Economic and military advantages are supposedly on offer. As a result, the incentive to drive innovation, supply energy and purchase compute capabilities is at an all-time high from Beijing to DC, and from Europe to the Persian Gulf.
Is the progress real? Yes, but it’s unclear where it is headed. The debate about AGI’s geopolitical impact is being fuelled by progress on performance benchmarks, with little alignment as to what that progress demonstrates.
To justify the claims of imminent AGI, models not only have to be improving along some measures, but specifically along those measures which matter in creating AGI.
As yet, there is no consensus on what these measures should be.
What does the research show? Experimental results tend to confuse rather than clarify. Shortly after o1-preview’s release, Apple researchers tested 25 open and closed-source, state-of-the-art LLMs on an altered version of a Grade School Mathematics benchmark — using variants of the original questions.
In one test, irrelevant and inconsequential statements were inserted — leaving the required reasoning steps of the problem intact — resulting in “catastrophic performance decline.”
Although the models varied in their robustness to such factors, with o1-preview proving the most robust to difficulty increases, both o1-preview and o1-mini exhibited a “significant performance drop” on problems including more confounding information.
Is jagged intelligence just janky intelligence? Perhaps. It is possible that these kinks will be worked out, and models will display ever-improving capabilities against benchmarks; a kind of “jagged intelligence” that gets fuller and less spiky over time and merely needs tweaking, more data, chips and power.
Part of the reasoning models’ appeal — and a way of making sense of their benchmark scores — is that they break down problems into ‘chains-of-thought’ that look much like a human’s verbalized thought process.
However, one recent study suggested that the chains-of-thought that precede a model’s final answer are not entirely responsible for its final output.
Another study showed that newer models perform increasingly well on mathematics benchmarks which score only the final numerical answer, not the working.
And although Apple’s “Illusion of Thinking” paper on the limitations of reasoning models quickly polarized commentators, other research converges on a similar theme: increasing the complexity of a problem causes LLM performance to plateau or degrade.
In a sense, this research culminates to show that the process behind current LLMs might be more fragile, and less plastic or adaptable, than leading labs would hope.
The upshot? Large language models have clear limitations. Nevertheless, they may drive improvements in economic productivity, standards of living, or military capabilities as diffusion into specific applications which mask these limitations takes place.
More rigorous assessment of their current capabilities should have us reaching for a larger pinch of salt, especially when it comes to claims of imminent AGI.
Vincent J. Carchidi, was a co-author on this edition of Inferences. Vincent is a defense and technology analyst, specializing in critical and emerging technologies. His opinions are his own and do not reflect those of his employer. You can follow him on LinkedIn and X.
What we’re reading:
A case for “context-aware” measurement of AI model performance.
A roundtable review of “The United States, China, and the Competition for Control” by Melanie Sisson, examining the US’ “trajectory of unwillingness to cooperate within the order” it created.
A new study claiming to successfully “predict and capture human cognition” in Nature.
What we’re looking ahead to:
6 - 7 July 2025: Annual BRICS Summit, Rio de Janeiro, Brazil.
9 - 11 July 2025: AI for Good Global Summit.
9 - 23 Sep 2025: UN General Assembly (UNGA 80), New York.
22 - 23 Oct 2025: G20 Leaders’ Summit, Johannesburg.
10 - 20 Nov 2025: UN Climate Change Conference (COP30).
February 2026: India Global AI Summit (expected).