Gemini 3.5 or even 4.0? Codename “Snow Bunny”: Leaked benchmark data of a supposedly new Google model

Konrad Wolfenstein

5 months ago

Gemini 3.5 or even 4.0? Codename “Snow Bunny”: Leaked benchmark data of a supposedly new Google model – Image: Xpert.Digital

The turning point in artificial intelligence? Google's technological breakthrough that redefines global competitiveness?

An engineering adventure on the edge of the cognitive revolution

The benchmark data leaked in January 2026 from a supposedly new Google model codenamed “Snow Bunny” symbolizes a profound turning point in artificial intelligence that goes far beyond mere number games. Instead of incremental progress in model development, this data reveals a phenomenon that weaves the core architecture of human thought itself into the technical foundation of artificial intelligence. The performance differences are not simply numerical, but qualitatively transformative, with direct implications for European and German industrial policy and the future of competition between the tech superpowers of the USA, China, and a fragmented Europe.

The hieroglyphic benchmark, on which Snowbunny reportedly achieves an 80 percent success rate—well ahead of GPT-5.2 at 55 percent and Gemini 3.0 Pro at 45 percent—doesn't simply test knowledge or pattern recognition, but rather lateral thinking. Lateral thinking is the human ability to see connections between unrelated concepts, to creatively circumvent established thought patterns, and to approach problems from unusual angles. It's a mechanism that defies purely statistical prediction and is the reason why creativity, innovation, and genuine problem-solving don't arise from scaling alone. Academic research consistently documents that even the best available models fall below 50 percent on lateral thinking tasks. Snowbunny appears to have significantly surpassed this threshold.

The underlying technical innovation is profound in its system architecture. Google has evidently implemented what has been intensively pursued in AI research since 2025: a division of cognitive thinking into what psychologist Daniel Kahneman calls “System 1” and “System 2” thinking. System 1 is the lightning-fast, intuitive thinking of statistical patterns. System 2 is the slow, deliberate thinking that counts steps, questions assumptions, and evaluates multiple solution paths in parallel. Previous models like GPT-5.2 or Gemini 3.0 primarily optimize System 1, the raw-speed pattern-matching capability, with some superficial attempts to feign slower thinking through “chain-of-thought” prompting. Snowbunny’s architecture appears to implement a genuinely deeper reasoning framework—one that truly pursues multiple thought paths in parallel, tests hypotheses, and iteratively refines them.

The security focus remains transparent, no longer merely a cost factor

One detail of the leaks is particularly significant for experts: both versions of the model, the "raw" variant and the "less raw" variant with stricter safety filters, achieve identical 80 percent success rates. This contradicts a long-held assumption in AI research that safety alignment, i.e., training against problematic outputs, necessarily impairs pure cognitive performance. If Google has resolved this classic efficiency-safety tradeoff, it represents a non-trivial breakthrough in post-training methodology. The implications are profound: it suggests that safety and capability need not be antagonistic, but that restructured training pipelines can maximize both simultaneously.

The comparison data itself requires caution. Benchmark screenshots are easily manipulated, and while the Hieroglyph test is known in academic circles, it is not as widely established and standardized as the classic MMLU (Massive Multitask Language Understanding) test, which remains the gold standard for general knowledge. However, the leaked data aligns with Google's public announcements in that the company introduced a feature called "Gemini Deep Think" back in November 2025—a mode in which Gemini models are allowed more time to think before responding, and which achieves measurable improvements on established benchmarks such as ARC-AGI-2 (45.1 percent) and GPQA Diamond (93.8 percent). This publicly verified data and the leaked Hieroglyph results speak a similar language: the point where computing power translates into true cognitive depth has been reached.

The market as an indicator of genuine competitive change

Market dynamics underpin the technical narrative with remarkable clarity. OpenAI's market share among AI users fell from 87 percent to 68 percent in 2025. At the same time, Google's Gemini rose from 5.4 percent to 18.2 percent. This shift is not primarily driven by data discrimination or media circulation, but by a structural change in how AI is integrated into the productivity stack. Google has embedded Gemini in Chrome, Android, and Google Workspace—it's no longer an application that users consciously open, but an ambient capability already present in the operating system and everyday work tools. Adoption is thus no longer an active choice, but a default phenomenon.

At the same time, Google is pursuing an aggressive pricing strategy. While GPT-5.2 costs $1.75 per million input tokens, Gemini Flash is priced at $0.50—a 71 percent discount. This isn't a promotional offer for market penetration, but a structural repositioning. With its own TPUs (Tensor Processing Units) and custom-chip infrastructure, Google has a radical cost structure advantage over OpenAI, which relies on Nvidia's GPUs and Microsoft's Azure infrastructure. This hardware depth isn't easily replicated.

The strategy is brilliant, but also worrying for European and especially German industrial companies. Google's approach is "enterprise-out"—not "consumer-first" like OpenAI. Google integrates AI into the tools companies already use. It bundles Gemini with Google Workspace, creates over 1,500 pre-built AI agents, and integrates natively with Salesforce, SAP, and ServiceNow. The strategic message is strong: why buy separate ChatGPT subscriptions when the AI is already in the productivity suite?

Morgan Stanley estimates that if Google converts just 30 percent of its existing Workspace customer base to Gemini Enterprise, it could generate $8-10 billion in annual recurring revenue by 2027—with operating margins exceeding 40 percent. This isn't speculation, but rather arithmetic based on available customer numbers and proven SaaS upgrade patterns.

🤖🚀 Managed AI Platform: Faster, safer & smarter to AI solutions with UNFRAME.AI

Managed AI Platform - Image: Xpert.Digital

Here you will learn how your company can implement customized AI solutions quickly, securely and without high entry barriers.

A managed AI platform is your all-inclusive, worry-free solution for artificial intelligence. Instead of dealing with complex technology, expensive infrastructure, and lengthy development processes, you receive a ready-made solution tailored to your needs from a specialized partner – often within just a few days.

The key advantages at a glance:

⚡ Rapid implementation: From idea to ready-to-use application in days, not months. We deliver practical solutions that create immediate added value.

🔒 Maximum data security: Your sensitive data stays with you. We guarantee secure and compliant processing without sharing data with third parties.

💸 No financial risk: You only pay for results. High upfront investments in hardware, software, or personnel are completely eliminated.

🎯 Focus on your core business: Concentrate on what you do best. We take care of the entire technical implementation, operation, and maintenance of your AI solution.

📈 Future-proof & scalable: Your AI grows with you. We ensure continuous optimization and scalability, and flexibly adapt the models to new requirements.

More information here:

Managed AI Platform

More than just scaling? Is the next generation of AI already learning to think for real? Why the new AI could be more than just a productivity tool

Lateral thinking as an economic factor: The infrastructure of innovation

Why is lateral thinking economically relevant? Because true innovation—not merely scaling existing patterns, but recognizing new spaces of possibility—requires precisely these cognitive abilities. An AI system that can only address problems through statistical pattern recognition will function in narrowly defined domains but will blindly encounter innovative leaps. However, if an AI system can construct parallel hypotheses, test them against each other, and scan for unexpected connections, then it suddenly possesses true generalizability. It can handle ambiguity. It can evaluate multi-valued options.

For German industry, particularly the management of mid-sized companies in the mechanical engineering, automation systems, and logistics sectors, this poses a direct innovation challenge. An AI partner capable of lateral thinking is a genuine innovation tool. An AI partner limited to GPT 5.2-style reasoning is an efficient document writer and code generator, but not a strategic advisor. This is the difference between a "productivity tool" and a "strategic capability."

Going even further: If Google's Snow Bunny checkpoint is indeed incorporated into the upcoming Gemini 3.5 (which technical insiders suspect based on the naming convention and timeline logic), then the balance of power in the AI industry will fundamentally shift in 2026. Not just a little. Fundamentally.

The architecture of the breakthrough: Not just scaling

A critical point: The improvement did not result from additional parameters or increased computing power. That was the research question from 2023 to 2025: whether mere scaling would suffice. Now it turns out: It is not. A genuine architectural innovation was needed. A paradigm shift from “predict the next token statistically” to “decompose the problem, reason hierarchically, verify.” The technical literature on Hierarchical Reasoning Models (HRM) and Neuro-Symbolic AI has already demonstrated since 2024-2025 that such architectures are possible and that they can achieve better reasoning performance with significantly fewer parameters than pure scaling approaches.

Google has clearly put a version of this into production. OpenAI and Anthropic (Claude) are even more deeply embedded in the scale-first paradigm. This is a strategic difference, not a marginal one. It also explains why the sheer number of billions of parameters is no longer the only factor.

The risks are not marginal

The authenticity of the data remains unclear. Benchmark leaks are easy to manipulate, and the AI industry has repeatedly experienced erosion of benchmark integrity in 2024-2025. Score brushing, training data contamination, selective reporting—these practices are well-documented. A cautious analyst would advise: Don't trust the screenshots, wait for general availability (GA), and conduct independent evaluations.

However, the technical insider information about “Deep Think” mode, parallel code generation (3,000 lines in one prompt), and SVG and music generation capabilities—all of this is already documented in beta tester reports and confirmed with Vertex AI Cloud integration. This reduces the risk of manipulation. Google would have too much to lose if these benchmarks were fake. The company might be a less transparent competitor, but it's not stupid.

Strategic implications for European industry

This is where things get serious. Europe doesn't have a major player in the Foundation Model game. Not really. Mistral, founded in France, is fighting for survival against open-source alternatives. Aleph Alpha, the German startup, gave up its independence long ago. Europe is exporting talent to OpenAI, Google, and Anthropic instead of retaining it. The continent is producing research papers but not winning markets.

The emerging dynamics are dangerous. Google will sharpen its enterprise AI offering with Snow Bunny/Gemini 3.5. If German machine manufacturers, logistics companies, and SMEs are fundamentally dependent on Google, Microsoft (with OpenAI integration), or Anthropic, then they are in a strategic dependency. They pay to grow with the technology, but they don't control it. For a country like Germany, which has built its competitiveness on technological depth, this is a medium-term risk.

Germany is a global leader in Industry 4.0 and automation. But if the cognitive layer—the AI that thinks about production processes—comes from the US, then Germany is delegating the strategic level. This is a classic trap: remaining technically strong at the lower levels, but losing control over top-level decisions and innovation.

Is there a way back or to the side? It's difficult. Open-source models (Llama, Qwen, Mistral) are cheaper, but they lag behind frontier models in terms of reasoning depth. A "European AI" program would cost years and trillions. The practical path is likely this: European industry must work with frontier models but develop its own specializations and domain expertise that the generalist models cannot simply replicate. This is possible, but requires organizational depth and investment in talent, not just API calls.

The larger narrative: The shift to cognitive depth

We are at the turning point from an era of scaling to an era of cognitive depth. The years 2017-2023 were "Bigger Models, Better Results"—the GPT-2 to GPT-3 to GPT-4 narrative was pure scaling. 2024-2025 was the year when this limit of efficiency became apparent. You couldn't achieve 10 times better results with 10 times more parameters. You had to think (architecturally) and innovate.

Google, with its research labs (DeepMind + Google Brain unified), its TPU investments, and its long-term horizon, was prepared for this transition. OpenAI is more reactive, better at public relations, but somewhat behind the curve in the research cycle game. That's the situation in January 2026.

The hieroglyph benchmark and the Snowbunny leaks are symptoms of this deeper shift. Not because a new model is good at solving puzzles, but because genuine System 2 thinking has been implemented at production scale.

This has consequences not only for the AI industry, but for all industries that understand AI as a strategic input. And that should really be everyone.

Consulting - Planning - Implementation