AI showdown on the ARC benchmark of AI models: GPT-5 vs. Grok vs o3
Xpert pre-release
Language selection 📢
Published on: August 8, 2025 / Updated on: August 8, 2025 – Author: Konrad Wolfenstein
The great disillusionment: Why increasingly large AI models fail the crucial intelligence test
What is the ARC-AGI benchmark and why was it developed?
The ARC-AGI benchmark is a series of tests for measuring the general intelligence of AI systems, developed by François Chollet in 2019. ARC stands for "Abstraction and Reasoning Corpus for Artificial General Intelligence." The benchmark was created to evaluate the ability of AI systems to understand and solve new tasks for which they have not been explicitly trained.
The development of the benchmark is based on Chollet's definition of intelligence from his seminal paper "On the Measure of Intelligence." He argues that true intelligence lies not in the mastery of specific tasks, but in the efficiency of acquiring new skills. The test consists of visual puzzles with colored grids, where AI systems must recognize the underlying transformation rules and apply them to new examples.
How does ARC-AGI differ from other AI benchmarks?
Unlike conventional AI tests, which often rely on prior knowledge or memorized patterns, ARC-AGI focuses on so-called "core knowledge priors" – cognitive skills such as object permanence, counting, and spatial understanding. These skills are typically acquired by the age of four.
The key difference is that ARC-AGI is specifically designed to be solvable through pure memorization or data interpolation. Each task in the benchmark is unique and was developed specifically for the test, so no examples of it should exist online. This makes the test resistant to the usual strategies of AI systems based on large amounts of training data.
What are the different versions of the ARC-AGI benchmark?
There are now three main versions of the benchmark:
ARC-AGI-1
The original 2019 version, which consists of static visual puzzles, has humans achieving an average of 95%, while most AI systems have long been below 5%.
ARC-AGI-2
This enhanced version, released in 2025, is specifically designed to challenge even modern reasoning systems. While humans continue to achieve nearly 100% performance, even advanced AI models can only manage 10-20% of the tasks.
ARC-AGI-3
The latest version, still in development, introduces interactive elements. Instead of static puzzles, AI agents must learn through exploration and trial and error in a grid world, similar to how humans explore new environments.
How do different AI models perform in the ARC-AGI tests?
The performance differences between different AI models are significant:
On ARC-AGI-1, Grok 4 achieves approximately 68%, while GPT-5 is at 65.7%. The cost per task is approximately $1 for Grok 4 and $0.51 for GPT-5.
On ARC-AGI-2, the more difficult test, performance drops dramatically: GPT-5 achieves only 9.9% at a cost of $0.73 per task, while Grok 4 (Thinking) performs better at about 16%, albeit at a significantly higher cost of $2-4.
As expected, cheaper model variants show weaker performance: GPT-5 Mini achieves 54.3% on AGI-1 and 4.4% on AGI-2, while GPT-5 Nano only achieves 16.5% and 2.5%, respectively.
What is the secret of the o3 preview model?
OpenAI's o3-preview model represents a special case. In December 2024, it achieved an impressive 75.7% to 87.5% on ARC-AGI-1, depending on the computing power used. This was the first time an AI system surpassed the human performance threshold of 85%.
However, there is one important limitation: The publicly available version of o3 performs significantly worse than the original preview version. According to the ARC Prize, the released version of o3 achieves only 41% (low compute) and 53% (medium compute) on ARC-AGI-1, compared to the 76-88% of the preview version.
OpenAI confirmed that the published model has a different, smaller architecture and is optimized for chat and product applications. This discrepancy raises questions about its actual capabilities and highlights the importance of critically examining benchmark results from unpublished models.
How does the ARC Prize competition work?
The ARC Prize is an annual competition with a total prize fund of over one million US dollars aimed at fostering open-source progress toward AGI. The current 2025 competition runs from March 26 to November 3 on the Kaggle platform.
The pricing structure includes:
- Grand Prize (700,000 USD): Unlocked when a team achieves 85% accuracy on the private evaluation dataset
- Top Score Prize (75,000 USD): For the teams with the highest scores
- Paper Prize (50,000 USD): For the most significant conceptual advances
- Additional Prizes (175,000 USD): Additional categories to be announced
Importantly, all winners must publish their solutions as open source. This is in line with the ARC Prize Foundation's mission to make AGI advances accessible to the entire research community.
What are the technical challenges of the ARC-AGI benchmark?
The tasks in ARC-AGI require several cognitive skills that are natural for humans but extremely difficult for AI systems:
Symbol interpretation
AI must understand abstract symbols and derive their meaning from the context.
Multi-level compositional thinking
Problems must be broken down into sub-steps and solved sequentially.
Context-dependent rule application
The same rule may need to be applied differently depending on the context.
Generalization from a few examples
Typically, only 2-3 demonstration pairs are available from which the transformation rule must be derived.
What role does test-time training play in solving ARC-AGI?
Test-time training (TTT) has proven to be a promising approach for improving performance on ARC-AGI. This method dynamically adapts model parameters to the current input data during inference, rather than relying solely on pre-trained knowledge.
MIT researchers have demonstrated that TTT significantly improves the performance of language models on ARC-AGI. The method allows the models to adapt during task solving and learn from specific examples. This mimics human problem-solving behavior, in which we spend more time on difficult problems.
EU/DE Data Security | Integration of an independent and cross-data source AI platform for all business needs
Ki-Gamechanger: The most flexible AI platform – tailor-made solutions that reduce costs, improve their decisions and increase efficiency
Independent AI platform: Integrates all relevant company data sources
- Fast AI integration: tailor-made AI solutions for companies in hours or days instead of months
- Flexible infrastructure: cloud-based or hosting in your own data center (Germany, Europe, free choice of location)
- Highest data security: Use in law firms is the safe evidence
- Use across a wide variety of company data sources
- Choice of your own or various AI models (DE, EU, USA, CN)
More about it here:
Artificial Intelligence Beyond Scale: Insights from the ARC-AGI Test
What do the results mean for the development of AGI?
The results reveal a clear gap between human and artificial intelligence. While humans solve ARC-AGI tasks intuitively, even state-of-the-art AI systems fail at basic reasoning tasks.
François Chollet argues that the current paradigm of AI development – training ever larger models with more data – has reached its limits. The poor results on ARC-AGI, despite exponentially increasing model size, prove, in his view, that "fluid intelligence does not arise from scaling pre-training."
The future could lie in new approaches such as test-time adaptation, where models can change their own states at runtime to adapt to new situations.
What does the future of the ARC-AGI benchmark look like?
The ARC Prize Foundation plans to continuously develop the benchmark. ARC-AGI-3, with its interactive elements, is scheduled for full release in 2026 and will include approximately 100 unique environments.
The Foundation's goal is to develop benchmarks that serve as a "north star" for AGI development. This not only aims to measure progress but also to guide research in directions that could lead to true general intelligence.
What are the economic implications of benchmark performance?
The cost of solving ARC-AGI tasks varies greatly between models and has a direct impact on practical applicability.
While simple tasks can be solved with API costs in the cent range, the costs for complex reasoning tasks rise rapidly. The o3 model, for example, can cost up to $1,000 per task at high computing power.
This cost structure demonstrates that even if technical breakthroughs are achieved, economic feasibility remains a crucial factor for the widespread adoption of AGI technologies.
What are the philosophical implications of the ARC-AGI results?
The results raise fundamental questions about the nature of intelligence. The benchmark shows that there is a fundamental difference between memorizing patterns and true understanding.
The fact that humans solve these tasks effortlessly while AI systems fail suggests that human intelligence functions qualitatively differently than current AI approaches. This supports Chollet's argument that AGI requires more than just larger models and more data.
How does ARC-AGI influence AI research?
The benchmark has already led to a rethink in AI research. Instead of focusing exclusively on scaling models, leading labs are now exploring alternative approaches such as test-time compute and adaptive systems.
This shift is also reflected in investments: companies are increasingly investing in research into more efficient reasoning and problem-solving instead of in ever-larger training runs.
What role does the open source community play?
The ARC Prize Foundation emphasizes the importance of open-source development for AGI advancements. All competition winners are required to make their solutions publicly available.
This philosophy is based on the conviction that AGI is too important to be developed solely in closed laboratories. The Foundation sees itself as a catalyst for a collaborative, transparent research community.
What are the limitations of the ARC-AGI benchmark?
Despite its importance, ARC-AGI also has limitations. Chollet himself emphasizes that passing the test does not equate to achieving AGI. The benchmark measures only one aspect of intelligence – the ability to solve abstract problems.
Other important aspects such as creativity, emotional intelligence, or long-term planning are not measured. Furthermore, there is a risk that systems specifically optimized for ARC-AGI will be developed that pass the test without being truly intelligent in general.
How are the costs of AI models developing in the context of ARC-AGI?
Cost trends are showing interesting trends. While performance increases only slowly, the costs for marginal improvements are exploding.
This cost dynamic leads to an important insight: efficiency is becoming the key differentiator. The ARC Prize Foundation emphasizes that not only accuracy but also the cost per solved task is an important criterion.
What does ARC-AGI mean for the future of work?
The results have reassuring implications for many professions. The inability of AI systems to solve basic reasoning tasks demonstrates that human cognitive abilities are far from being replaced.
At the same time, progress in specialized tasks suggests that AI will continue to serve as a tool to support human work rather than replace it entirely.
What new research approaches are emerging through ARC-AGI?
The benchmark has inspired several innovative research directions:
Program Synthesis
Systems that generate programs to solve problems.
Neurosymbolic approaches
Combination of neural networks with symbolic reasoning.
Multi-agent systems
Several specialized agents work together.
Evolutionary algorithms
Systems that develop solutions in an evolutionary manner.
What is the ARC Prize Foundation's vision for the future?
The Foundation has a clear mission: to serve as a "North Star" for the development of open AGI. This isn't just about setting technical benchmarks, but about creating an ecosystem that fosters innovation while ensuring that AGI advances benefit all of humanity.
The continuous development of new benchmark versions is intended to ensure that the bar is continually raised and research does not stagnate. With ARC-AGI-3 and future versions, the Foundation aims to further explore the limits of what AI can do and what it still lacks.
We are there for you – advice – planning – implementation – project management
☑️ SME support in strategy, consulting, planning and implementation
☑️ Creation or realignment of the AI strategy
☑️ Pioneer Business Development
I would be happy to serve as your personal advisor.
You can contact me by filling out the contact form below or simply call me on +49 89 89 674 804 (Munich) .
I'm looking forward to our joint project.
Xpert.digital – Konrad Wolfenstein
Xpert.Digital is a hub for industry with a focus on digitalization, mechanical engineering, logistics/intralogistics and photovoltaics.
With our 360° business development solution, we support well-known companies from new business to after sales.
Market intelligence, smarketing, marketing automation, content development, PR, mail campaigns, personalized social media and lead nurturing are part of our digital tools.
You can find more at: www.xpert.digital – www.xpert.solar – www.xpert.plus