AI showdown on the ARC benchmark of AI models: GPT-5 vs. Grok vs o3

Konrad Wolfenstein

10 months ago

AI showdown on the ARC benchmark of AI models: GPT-5 vs. Grok vs o3 – Image: Xpert.Digital

The great disillusionment: Why increasingly larger AI models fail the crucial intelligence test

What is the ARC-AGI benchmark and why was it developed?

The ARC-AGI benchmark is a test series for measuring the general intelligence of AI systems, developed in 2019 by François Chollet. ARC stands for “Abstraction and Reasoning Corpus for Artificial General Intelligence.” The benchmark was created to evaluate the ability of AI systems to understand and solve new tasks for which they were not explicitly trained.

The benchmark's development is based on Chollet's definition of intelligence from his seminal paper, "On the Measure of Intelligence." He argues that true intelligence lies not in mastering specific tasks, but in the efficiency of acquiring new skills. The test consists of visual puzzles with colored grids, where AI systems must identify the underlying transformation rules and apply them to new examples.

How does ARC-AGI differ from other AI benchmarks?

Unlike conventional AI tests, which often rely on prior knowledge or memorized patterns, ARC-AGI focuses on so-called “Core Knowledge Priors”—fundamental cognitive skills such as object permanence, counting, and spatial reasoning. These skills are typically acquired by humans around the age of four.

The crucial difference lies in the fact that ARC-AGI is specifically designed to be unsolvable through mere memorization or data interpolation. Each task in the benchmark is unique and was developed specifically for the test, so no examples of it should exist online. This makes the test resistant to the typical strategies of AI systems that rely on large training datasets.

What are the different versions of the ARC-AGI benchmark?

There are now three main versions of the benchmark:

ARC-AGI-1

The original 2019 version consists of static visual puzzles. Humans achieve an average score of 95% on this game, while most AI systems have long scored below 5%.

ARC-AGI-2

This enhanced version was released in 2025 and is specifically designed to pose a challenge even for modern reasoning systems. While humans continue to achieve nearly 100% success, even advanced AI models only manage 10-20% of the tasks.

ARC-AGI-3

The latest version, still under development, introduces interactive elements. Instead of static puzzles, AI agents must learn through exploration and trial and error in a grid world, much like humans explore new environments.

How do different AI models perform in the ARC-AGI tests?

The performance differences between different AI models are significant:

For ARC-AGI-1, Grok 4 achieves approximately 68%, while GPT-5 reaches 65.7%. The cost per task is approximately US$1 for Grok 4 and US$0.51 for GPT-5.

In ARC-AGI-2, the more difficult test, performance drops drastically: GPT-5 achieves only 9.9% at a cost of $0.73 per task, while Grok 4 (Thinking) performs better at about 16%, but at a significantly higher cost of $2-4.

As expected, cheaper model variants show weaker performance: GPT-5 Mini achieves 54.3% on AGI-1 and 4.4% on AGI-2, while GPT-5 Nano only reaches 16.5% and 2.5% respectively.

What is the secret behind the o3 preview model?

OpenAI's o3 preview model represents a special case. In December 2024, it achieved impressive performance scores of 75.7% to 87.5% on ARC-AGI-1, depending on the computing power used. This was the first time an AI system had surpassed the human performance limit of 85%.

However, there is one important limitation: The publicly available version of o3 performs significantly worse than the original preview version. According to ARC Prize, the released o3 only achieves 41% (low compute) and 53% (medium compute) on ARC-AGI-1, compared to the 76-88% of the preview version.

OpenAI confirmed that the published model has a different, smaller architecture and is optimized for chat and product applications. This discrepancy raises questions about its actual capabilities and highlights the importance of critically evaluating benchmark results from unpublished models.

How does the ARC Prize competition work?

The ARC Prize is an annual competition with a total prize purse of over one million US dollars, aiming to promote open-source progress towards AGI (Actively Generic Architecture). The current 2025 competition runs from March 26 to November 3 on the Kaggle platform.

The pricing structure includes:

Grand Prize (USD 700,000): Unlocked when a team achieves 85% accuracy on the private evaluation dataset
Top Score Prize (USD 75,000): For the teams with the highest scores
Paper Prize (USD 50,000): For the most significant conceptual advances
Other prizes (USD 175,000): Additional categories to be announced

It is important that all winners publish their solutions as open source. This aligns with the mission of the ARC Prize Foundation to make AGI advances accessible to the entire research community.

What are the technical challenges of the ARC-AGI benchmark?

The tasks in ARC-AGI require several cognitive abilities that are self-evident to humans but extremely difficult for AI systems:

Symbol interpretation

AI must understand abstract symbols and derive their meaning from the context.

Multi-stage compositional thinking

Problems must be broken down into sub-steps and solved sequentially.

Context-dependent rule application

The same rule may need to be applied differently depending on the context.

Generalization from a few examples

Typically, only 2-3 demonstration pairs are available from which the transformation rule must be derived.

What role does test-time training play in solving ARC-AGI?

Test-Time Training (TTT) has proven to be a promising approach for improving performance on ARC-AGI. This method dynamically adjusts the model parameters to the current input data during inference, instead of relying solely on pre-trained knowledge.

MIT researchers have shown that TTT significantly improves the performance of language models on ARC-AGI. The method allows the models to adapt during task solving and learn from specific examples. This mimics human problem-solving behavior, where we spend more time on difficult problems.

EU/DE Data Security | Integration of an independent and cross-data-source AI platform for all business needs

Independent AI platforms as a strategic alternative for European companies - Image: Xpert.Digital

AI Game Changer: The most flexible AI platform - Tailor-made solutions that reduce costs, improve your decisions and increase efficiency

Independent AI platform: Integrates all relevant company data sources

Rapid AI integration: Tailor-made AI solutions for businesses in hours or days, instead of months
Flexible infrastructure: Cloud-based or hosting in your own data center (Germany, Europe, free choice of location)

Maximum data security: its use in law firms is irrefutable proof
Deployment across a wide variety of enterprise data sources
Choice of own or different AI models (DE, EU, USA, CN)

More information here:

Independent AI platforms vs. hyperscalers: Which solution is the right fit?

Artificial intelligence beyond scaling: Insights from the ARC-AGI test

What do the results mean for the development of AGI?

The results reveal a significant gap between human and artificial intelligence. While humans solve ARC-AGI tasks intuitively, even the most advanced AI systems fail at basic cognitive tasks.

François Chollet argues that the current paradigm of AI development—training ever larger models with more data—has reached its limits. The poor results on ARC-AGI, despite exponential increases in model size, prove, in his view, that “fluid intelligence does not arise from scaling pre-training.”.

The future could lie in new approaches such as Test-Time Adaptation, where models can change their own states at runtime to adapt to new situations.

What does the future hold for the ARC-AGI benchmark?

The ARC Prize Foundation plans continuous development of the benchmark. ARC-AGI-3, with its interactive elements, is scheduled for full release in 2026 and will include approximately 100 unique environments.

The Foundation aims to develop benchmarks that will serve as a "North Star" for AGI development. This involves not only measuring progress but also guiding research in directions that could lead to true general intelligence.

What are the economic implications of benchmark performance?

The cost of solving ARC-AGI problems varies greatly between models and has a direct impact on practical applicability.

While simple tasks can be solved with API costs in the cent range, the costs for complex reasoning tasks rise rapidly. The o3 model, for example, can cost up to $1,000 per task with high computing power.

This cost structure shows that even if technical breakthroughs are achieved, economic feasibility remains a crucial factor for the widespread application of AGI technologies.

What are the philosophical implications of the ARC-AGI results?

The results raise fundamental questions about the nature of intelligence. The benchmark shows that there is a fundamental difference between memorizing patterns and true understanding.

The fact that humans solve these tasks effortlessly, while AI systems fail, suggests that human intelligence functions qualitatively differently from current AI approaches. This supports Chollet's argument that AGI requires more than just larger models and more data.

How does ARC-AGI influence the direction of AI research?

The benchmark has already led to a rethink in AI research. Instead of focusing solely on scaling models, leading labs are now exploring alternative approaches such as test-time compute and adaptive systems.

This shift is also reflected in investments: companies are increasingly investing in research on more efficient reasoning and problem-solving instead of ever larger training runs.

What role does the open-source community play?

The ARC Prize Foundation emphasizes the importance of open-source development for AGI progress. All competition winners must make their solutions publicly available.

This philosophy is based on the conviction that AGI is too important to be developed solely in closed laboratories. The Foundation sees itself as a catalyst for a collaborative, transparent research community.

What are the limitations of the ARC-AGI benchmark?

Despite its importance, ARC-AGI also has limitations. Chollet himself emphasizes that passing the test is not synonymous with achieving AGI. The benchmark measures only one aspect of intelligence – the ability to solve abstract problems.

Other important aspects such as creativity, emotional intelligence, or long-term planning are not assessed. Furthermore, there is a risk that systems specifically optimized for ARC-AGI will be developed that pass the test without actually being generally intelligent.

How are the costs for AI models developing in the context of ARC-AGI?

The cost development reveals interesting trends. While performance increases only slowly, the costs for marginal improvements are exploding.

This cost dynamic leads to an important insight: efficiency is becoming the decisive differentiator. The ARC Prize Foundation emphasizes that not only accuracy, but also the cost per solved problem is a crucial criterion.

What does ARC-AGI mean for the future of work?

The results have reassuring implications for many professions. The inability of AI systems to solve basic thinking tasks shows that human cognitive abilities are far from being replaced.

At the same time, progress in specialized tasks suggests that AI will continue to serve as a tool to support human work, rather than completely replacing it.

What new research approaches arise from ARC-AGI?

The benchmark has inspired several innovative research directions:

Program Synthesis

Systems that generate programs to solve problems.

Neurosymbolic approaches

Combination of neural networks with symbolic reasoning.

Multi-agent systems

Several specialized agents are working together.

Evolutionary algorithms

Systems that develop solutions through evolution.

What is the ARC Prize Foundation's vision for the future?

The Foundation pursues a clear mission: to serve as a "North Star" for the development of open AGI. This involves not only technical benchmarks, but also the creation of an ecosystem that fosters innovation while ensuring that AGI advances benefit all of humanity.

The continuous development of new benchmark versions is intended to ensure that the bar is constantly raised and research does not stagnate. With ARC-AGI-3 and future versions, the Foundation aims to further explore the limits of what AI can do and what it still lacks.

We are here for you - Consulting - Planning - Implementation - Project Management

☑️ SME support in strategy, consulting, planning and implementation

☑️ Creation or realignment of the AI strategy

☑️ Pioneer Business Development

Konrad Wolfenstein

I would be happy to serve as your personal advisor.

You can contact me by filling out the contact form below or simply call me on +49 7348 4088 965 .

I'm looking forward to our joint project.

Write to me

➡️ Video call request 👩👱

Xpert.Digital - Konrad Wolfenstein

Xpert.Digital is a hub for industry focusing on digitalization, mechanical engineering, logistics/intralogistics and photovoltaics.

With our 360° Business Development solution, we support renowned companies from new business to after-sales.

Market intelligence, smarketing, marketing automation, content development, PR, mail campaigns, personalized social media and lead nurturing are part of our digital tools.

You can find more information at: www.xpert.digital - www.xpert.solar - www.xpert.plus

Keep in touch

The great disillusionment: Why increasingly larger AI models fail the crucial intelligence test

What is the ARC-AGI benchmark and why was it developed?

How does ARC-AGI differ from other AI benchmarks?

What are the different versions of the ARC-AGI benchmark?

ARC-AGI-1

ARC-AGI-2

ARC-AGI-3

How do different AI models perform in the ARC-AGI tests?

What is the secret behind the o3 preview model?

How does the ARC Prize competition work?

What are the technical challenges of the ARC-AGI benchmark?

Symbol interpretation

Multi-stage compositional thinking

Context-dependent rule application

Generalization from a few examples

What role does test-time training play in solving ARC-AGI?

EU/DE Data Security | Integration of an independent and cross-data-source AI platform for all business needs

AI Game Changer: The most flexible AI platform - Tailor-made solutions that reduce costs, improve your decisions and increase efficiency

Independent AI platform: Integrates all relevant company data sources

Artificial intelligence beyond scaling: Insights from the ARC-AGI test

What do the results mean for the development of AGI?

What does the future hold for the ARC-AGI benchmark?

What are the economic implications of benchmark performance?

What are the philosophical implications of the ARC-AGI results?

How does ARC-AGI influence the direction of AI research?

What role does the open-source community play?

What are the limitations of the ARC-AGI benchmark?

How are the costs for AI models developing in the context of ARC-AGI?

What does ARC-AGI mean for the future of work?

What new research approaches arise from ARC-AGI?

Program Synthesis

Neurosymbolic approaches

Multi-agent systems

Evolutionary algorithms

What is the ARC Prize Foundation's vision for the future?

☑️ SME support in strategy, consulting, planning and implementation

☑️ Creation or realignment of the AI ​​strategy

☑️ Pioneer Business Development

Other topics

☑️ Creation or realignment of the AI strategy