Published on: June 13, 2025 / Updated on: June 13, 2025 – Author: Konrad Wolfenstein

For robots and other AI agents: The V-JEPA 2 AI model from Meta – The AI that understands our physical world – Image: Xpert.Digital
Meta presents V-JEPA 2: AI system learns to make predictions about the physical world
Meta publishes V-JEPA 2: A revolutionary AI world model for the future of artificial intelligence
Meta has unveiled V-JEPA 2, a groundbreaking AI system that takes a fundamentally different approach than conventional Grand Language Models. This world model, with its 1.2 billion parameters, was developed to help robots and other AI agents understand the physical world and predict how it will react to their actions.
What is V-JEPA 2 and how does it differ from language models?
V-JEPA 2 stands for “Video Joint Embedding Predictive Architecture 2” and is based on a completely different architecture than traditional language models. While language models like ChatGPT or GPT-4 make probabilistic predictions about text sequences, V-JEPA 2 operates in an abstract representational space and focuses on understanding physical laws.
The crucial difference lies in the learning method: language models require large amounts of labeled data and learn through supervised training. V-JEPA 2, on the other hand, uses self-supervised learning and extracts knowledge from unlabeled videos, thereby significantly reducing data preparation costs. The model learns not through pixel reconstruction, but through abstract representations of the video content.
The JEPA architecture: Learning through prediction
The Joint Embedding Predictive Architecture (JEPA) was developed by Yann LeCun, Meta's Chief AI Scientist, and represents an alternative to generative AI models. Unlike generative approaches, which attempt to reconstruct every missing pixel, V-JEPA 2 works with masked video regions and learns to predict abstract concepts.
The system uses a two-stage training approach:
First phase: Self-supervised learning
- Training with over one million hours of video material and one million images
- Learning physical interaction patterns without human annotation
- Development of an internal model of the physical world
Second phase: Action-induced adaptation
- Fine-tuning with only 62 hours of robot control data from the DROID dataset
- Integration of agent actions into predictive capabilities
- Enabling planning and closed-loop control
Superior performance in practice
V-JEPA 2 demonstrates impressive performance in various areas:
Video understanding and motion detection
- 77.3% Top 1 accuracy in Something-Something v2 dataset
- 39.7% Recall-at-5 in Epic-Kitchens-100 action prediction (44% improvement over previous models)
- State-of-the-art performance in various video question-and-answer tasks
Robot control
- 65-80% success rate in pick-and-place tasks in unfamiliar environments
- Zero-shot robot control without environment-specific training
- Deployment in two different laboratories with Franka robot arms
Efficiency compared to the competition
V-JEPA 2 is 30 times faster than NVIDIA's Cosmos model and only needs 16 seconds to plan a robot action, while Cosmos takes 4 minutes.
Technical innovations and key features
The model is characterized by five key technological breakthroughs:
- Self-supervised learning: Eliminates the need for large amounts of labeled data
- Masking mechanism: Trains the model by predicting hidden video areas
- Abstract representation learning: Focus on semantic meanings instead of pixel details
- World model architecture: Building an internal understanding of physical laws
- Efficient transfer learning: Outstanding zero-shot learning abilities
New benchmarks reveal the limits of current AI
In parallel to V-JEPA 2, Meta has released three new benchmarks that test the physical understanding of AI systems:
IntPhys 2
It tests the ability to distinguish between physically plausible and impossible scenarios. Even advanced models still perform close to randomness in this regard.
MVPBench
It uses visually similar video pairs with opposing answers to the same question. V-JEPA 2 achieves 44.5% Paired Accuracy – the best performance of all systems tested.
CausalVQA
The study examines causal understanding and counterfactual reasoning. The results show that current AI systems can describe what they see well, but have difficulty predicting alternative outcomes.
AI without data hunger: How V-JEPA 2 makes machine learning more efficient
Yann LeCun sees world models like V-JEPA 2 as the key to the next generation of AI development. The model could revolutionize various application areas:
Robotics and household assistants
World models are intended to usher in a new era of robotics, in which AI agents will be able to handle real-world tasks without astronomical amounts of training data.
Autonomous vehicles
V-JEPA 2's real-time spatial understanding could be crucial for autonomous vehicles, warehouse robots, and drone delivery systems.
Augmented Reality (AR) and virtual assistants
Meta plans to expand the functionality of V-JEPA 2 by integrating audio analytics and enhanced video understanding capabilities for AR glasses and virtual assistants.
Open-source availability and research funding
Meta has released V-JEPA 2 as open source under the CC-BY-NC license to promote global AI research. The model code is available on GitHub and can be run on platforms such as Google Colab and Kaggle. This openness contrasts with many other large AI models and is intended to advance the development of world models in robotics and embodied AI.
A paradigm shift in AI development
V-JEPA 2 represents a fundamental paradigm shift from pure language processing to a deeper understanding of the physical world. While most AI companies rely on generative models, Meta pursues an alternative vision for the future of artificial intelligence with its world-model approach. The ability to learn from minimal data and enable zero-shot robot control could pave the way for a new generation of intelligent systems that can not only understand but also act in the real world.
Suitable for:
Your global marketing and business development partner
☑️ Our business language is English or German
☑️ NEW: Correspondence in your national language!
I would be happy to serve you and my team as a personal advisor.
You can contact me by filling out the contact form or simply call me on +49 7348 4088 965 (Munich) . My email address is: wolfenstein ∂ xpert.digital
I'm looking forward to our joint project.











