A gold at the Math Olympiad isn’t everything

SignalFire recently convened an expert panel to dissect how leading LLMs fared at the 2025 International Math Olympiad (IMO). At this year’s competition, held last month on Australia’s Sunshine Coast, a parallel competition emerged. While 110 of the world’s top students tackled complex mathematics problems using pen and paper, several AI research teams quietly evaluated pre-release models on the exam questions.

In a surprising twist that few of us expected, a private OpenAI model, a private Gemini DeepThink model, and a Gemini 2.5 Pro–based framework each took home unofficial golds by solving the same five of the six problems. The results sparked a deeper debate on where LLM performance is headed, particularly against the backdrop of an underwhelming GPT-5 release.

Beyond the hype: RL, inference-time compute, and measurable improvements

Ever since DeepSeek R1 debuted, the conversation around reinforcement learning (RL) and inference-time compute has shifted. Instead of manually creating and annotating reasoning chains, we can harness LLMs’ grasp of language and logic to search the semantic space for reasoning paths that lead to verifiably correct solutions. Much like human mathematicians build intuition by solving problems, learning not only what works but also which approaches lead to dead ends, these models refine their reasoning by exploring and discarding countless incorrect paths during training.

This expands the pool of training signals, but it also increases the compute required to generate them. In some domains, verification and data generation are relatively inexpensive; in others, such as browser or computer use, the cost becomes enormous at training scale.

With current architectures, strong domain generalization demands large, diverse datasets. Without them, models overfit, finding statistical shortcuts that score well without true understanding. The recent jump in STEM performance, fueled by labs paying PhD candidates to create training data, shows this clearly. Now, reinforcement learning on verifiable math problems is driving similar gains. Expect the next generation of models to make outsized progress in low-compute, easily verifiable domains such as math and coding.

Why this is not necessarily a shortcut to AGI

Many will point to breakthroughs in narrow domains, such as IMO results, as evidence that AGI is close. A more sober view is that these wins rely on a formula that works in domains like math and coding, where answers can be programmatically verified, data can be automatically generated, and rewards don’t require explicit step-by-step annotations. Many important domains lack such clean feedback loops. Common-sense reasoning, open-domain dialog, creative problem-solving, and understanding human intentions are far harder to quantify for automated reward.

Another limitation is context length and problem decomposition. IMO problems are relatively self-contained, often fitting known algebraic or combinatorial patterns. Many real-world tasks require reasoning over long documents or multi-step processes, and some, even within math, demand different modes such as visual or spatial proofs. Current models still have finite context windows and struggle with long-horizon reasoning. Tackling larger workflows often means breaking tasks into chunks and coordinating them through orchestration, which limits flexibility. Models can either “think longer” by spending vastly more computation, or break problems into pieces and manage them via orchestration. Both approaches have inherent limits.

A related challenge is long-term memory. Many real-world tasks span weeks or even months, requiring AI to retain context, recall past interactions, and integrate new information without retraining from scratch. Most LLMs are stateless: once a session ends, they forget everything unless that information is explicitly reintroduced. When all relevant data cannot fit into the prompt, external memory systems fall back on ad-hoc retrieval pipelines to select a subset of past information. These pipelines create failure points where critical context can be lost.

There is still no widely adopted, scalable “forgetting algorithm” that lets models keep what is relevant and discard what is not, avoiding both memory bloat and hallucinations from outdated knowledge. Google’s Titan architecture is one of the few publicized attempts at a persistent, managed memory system, but such approaches remain experimental. Without robust and efficient memory management, models will continue to struggle with long-horizon reasoning, where understanding depends on a living history rather than a single context window.

Finally, consider that general intelligence implies flexibility: the ability to fluidly handle any domain or task. But an AI that’s a savant at contest math or coding might still fail at commonsense reasoning or adaptivity outside its training distribution. The IMO accomplishment fits this pattern: it’s a remarkable milestone for automated math reasoning, but it doesn’t indicate the system possesses human-like understanding across all domains. In fact, human IMO champions themselves are specialists, and we wouldn’t assume a top mathematician is automatically an expert novelist or politician. Likewise, our math-genius models remain largely bounded by their domains of expertise.

Lack of geometrical thinking in Problem 6

This discussion would not be complete without going over the one IMO problem that no model was able to solve. While the other five gold-medal problems could be attacked through familiar algebraic or combinatorial reasoning, Problem 6 required a constructive tiling proof that humans almost universally approach through visual-spatial reasoning. A human mathematician would sketch the grid, look for geometric patterns, and very quickly see that a tile could lie to the left and right of regions. The independent Gemini 2.5 Pro system instead produced a purely algebraic, text-based derivation that entirely missed this possibility. The mistake was not just a wrong conclusion but a wrong mode of thought: it never engaged with the problem in the visual terms that make it tractable for humans.

This reflects a deeper gap. Even multimodal LLMs tend to translate all problems into linear sequences of text tokens rather than manipulating and reasoning over diagrams during “thinking” steps. Their internal search is optimized for extending textual proofs, not for testing spatial hypotheses or interacting with a visual workspace. To make progress on problems suited to geometric reasoning, training needs to reward visual reasoning itself, not just the final textual proof. That points toward multimodal RL where the agent thinks on a canvas, edits diagrams, runs geometry or tiling checkers, and receives stepwise rewards for maintaining correct visual invariants and constructions.

The challenge is a cold start because current pretraining does not build strong visual reasoning priors. In text, next-token prediction often exposes the model to full reasoning chains from math solutions, proofs, and worked examples, which teaches it to extend and critique arguments step by step. In vision, common objectives like masked-patch prediction or image and text contrastive learning do not require multi-step causal reasoning about spatial relationships. As a result, even multimodal models learn to treat images as static things to describe rather than dynamic environments to reason within. This remains an open problem that we expect to see improvement in as RL environments and model architectures for visual reasoning mature.

Why GPT-5 doesn’t show these gains

If reinforcement learning and inference-time search are so powerful, one might ask: Why didn’t OpenAI’s new GPT-5 model include them to blow past GPT-4 on general tasks? The short answer is that we are just at the beginning of the “RL era”, and building RL pipelines at scale is hard and time-consuming. Training an LLM with traditional supervised learning (next-word prediction on a big text corpus) is already a colossal effort, but relatively straightforward; you gather data and crank through it. By contrast, reinforcement learning at scale introduces new complexity: you need an environment for the model to interact with, a well-shaped reward function for each step, possibly human or AI feedback in the loop, and huge amounts of trial runs. For structured tasks like math proofs or code generation, researchers managed to create such pipelines (e.g., an automated proof checker or a suite of test cases acts as the “environment” providing rewards). But for a vast range of tasks, from using web browsers to controlling computers or engaging in open-ended dialogs, the infrastructure just isn’t there yet.

In practical terms, many existing applications (web browsing, office software, etc.) were built for humans and optimized for human-perceived latency and user experience, not for feeding millions of learning examples to an AI. One solution is to build new “RL-native” sandbox environments that mimic these tasks without the pitfalls. We’re already seeing efforts to create “highly realistic digital twin” environments that simulate the software or world for the AI to safely practice in, with the benefits of more visibility and lower computational costs. But developing such simulators (and making them faithful to the real world) is a massive undertaking, domain by domain. It’s often more feasible to start fresh (design a controlled environment from scratch) than to “retrofit the old application” for dense feedback. This build-out takes time, and we’re only beginning to see the fruits. OpenAI’s team alluded to this when they explained that GPT-5 did not incorporate the state-of-the-art math reasoning training used in their IMO demo model since those techniques weren’t production-ready yet. Given that building these pipelines takes time, we expect those RL-driven gains to surface across vendors in the next generation, particularly in specialized domains.

What the IMO tells us about the next phase of LLMs

The 2025 IMO results are a milestone worth celebrating, not because they prove we are on the verge of general intelligence, but because they reveal the kind of targeted, verifiable training loops that can unlock dramatic gains in specific domains. For math and code, the recipe is becoming clear: structured environments, cheap verification, and reinforcement learning that rewards correct reasoning steps. But most of the world’s knowledge work doesn’t yet fit that mold, and until we can build equally rich environments for messy, ambiguous tasks, progress will remain uneven. The next wave of breakthroughs will depend less on bigger base models and more on the painstaking, domain-by-domain construction of RL-native ecosystems and context window expansion. That’s a slower, more deliberate path, but also one that will ultimately shape whether LLMs grow from domain savants into truly general problem-solvers.

*Portfolio company founders listed above have not received any compensation for this feedback and may or may not have invested in a SignalFire fund. These founders may or may not serve as Affiliate Advisors, Retained Advisors, or consultants to provide their expertise on a formal or ad hoc basis. They are not employed by SignalFire and do not provide investment advisory services to clients on behalf of SignalFire. Please refer to our disclosures page for additional disclosures.

View all

Code, culture, and competitive edge: Who’s winning the engineering talent game?

People Management

Must-Read

Advice

Beacon AI

August 7, 2025

Code, culture, and competitive edge: Who’s winning the engineering talent game?

In 2025, attracting top engineers is tougher than ever. Our new report uncovers how leading tech companies hire, retain, and grow elite talent using 20 years of data.

What startups can learn from Anthropic's 80% talent retention rate

People Management

Must-Read

Advice

July 29, 2025

What startups can learn from Anthropic's 80% talent retention rate

Anthropic boasts an 95% offer acceptance and 80% retention rate. Nick Lewis shares the cultural and recruiting lessons for startups to win and keep top talent.

Why Networked SaaS is the new AI business model replacing per-seat pricing

Must-Read

Advice

Investment

July 1, 2025

Why Networked SaaS is the new AI business model replacing per-seat pricing

Discover why Networked SaaS—platforms that connect entire ecosystems, not just users—is replacing per-seat pricing and creating the next wave of multi-billion-dollar companies.

People Management

Must-Read

Advice

Beacon AI

August 7, 2025

Code, culture, and competitive edge: Who’s winning the engineering talent game?

In 2025, attracting top engineers is tougher than ever. Our new report uncovers how leading tech companies hire, retain, and grow elite talent using 20 years of data.

Must-Read

Advice

Investment

July 1, 2025

Why Networked SaaS is the new AI business model replacing per-seat pricing

Discover why Networked SaaS—platforms that connect entire ecosystems, not just users—is replacing per-seat pricing and creating the next wave of multi-billion-dollar companies.

Portfolio

Investment

January 29, 2025

The future of cybersecurity is non-human: Why we’re leading Clutch Security’s Series A

Non-human identities (NHIs) are the next frontier in cybersecurity, outnumbering human users 45:1 and serving as a prime target for attackers. SignalFire is leading Clutch Security’s $20 million Series A round to help enterprises close these backdoors for good. Learn how Clutch’s Zero Trust, AI-powered platform is redefining NHI security with ephemeral credentials and proactive threat mitigation.

A gold at the Math Olympiad isn’t everything - AGI’s moving finish line

Beyond the hype: RL, inference-time compute, and measurable improvements

Why this is not necessarily a shortcut to AGI

Lack of geometrical thinking in Problem 6

Why GPT-5 doesn’t show these gains

What the IMO tells us about the next phase of LLMs

Related posts

Code, culture, and competitive edge: Who’s winning the engineering talent game?

What startups can learn from Anthropic's 80% talent retention rate

Why Networked SaaS is the new AI business model replacing per-seat pricing

Code, culture, and competitive edge: Who’s winning the engineering talent game?

Why Networked SaaS is the new AI business model replacing per-seat pricing

The future of cybersecurity is non-human: Why we’re leading Clutch Security’s Series A

Beyond the hype: RL, inference-time compute, and measurable improvements

Why this is not necessarily a shortcut to AGI

Lack of geometrical thinking in Problem 6

Why GPT-5 doesn’t show these gains

What the IMO tells us about the next phase of LLMs

Subscribe to our newsletter

Share

Related posts

Code, culture, and competitive edge: Who’s winning the engineering talent game?

What startups can learn from Anthropic's 80% talent retention rate

Why Networked SaaS is the new AI business model replacing per-seat pricing

Code, culture, and competitive edge: Who’s winning the engineering talent game?

Why Networked SaaS is the new AI business model replacing per-seat pricing

The future of cybersecurity is non-human: Why we’re leading Clutch Security’s Series A

Privacy Settings