For robots and other AI agents: the AI model V-Jepa 2 from Meta-the AI that understands our physical world

Published on: June 13, 2025 / update from: June 13, 2025 - Author: Konrad Wolfenstein

For robots and other AI agents: the AI model V-Jepa 2 of Meta-The AI that understands our physical world-Image: Xpert.digital

Meta presents V-Jepa 2: AI system learns predictions about the physical world

Meta publishes V-Jepa 2: A revolutionary AI world model for the future of artificial intelligence

With V-Jepa 2, Meta has presented a groundbreaking AI system that pursues a fundamental approach than conventional large voice models. The 1.2 billion parameter strong world model was developed to help robots and other AI agents to understand the physical world and to predict how it will react to its actions.

What is V-Jepa 2 and how does it differ from voice models?

V-Jepa 2 stands for “Video Joint Embedding Predictive Architecture 2” and is based on a completely different architecture than traditional voice models. While voice models such as Chatgpt or GPT-4 make probabilistic predictions about text sequences, V-Jepa 2 works in an abstract representation room and focuses on understanding physical laws.

The decisive difference is in the learning method: Language models require large amounts of labeled data and learning through monitored training. V-JEPA 2, on the other hand, uses self-monitored learning and extracting knowledge from unwilled videos, which significantly reduces the costs for data preparation. The model does not learn through pixel reconstruction, but through abstract representations of the video content.

The Jepa architecture: learning by prediction

The Joint Embedding Predictive Architecture (JEPA) was developed by Yann Lecun, Metas Chief Ai Scientist and represents an alternative to generative AI models. In contrast to generative approaches that try to reconstruct every missing pixel, V-Jepa 2 works with masked video oaks and learns to predict abstract concepts.

The system uses a two -stage training approach:

First phase: self -monitored learning

Training with over a million hours of video material and one million pictures
Learn physical interaction patterns without human annotation
Development of an internal model of the physical world

Second phase: action -related adaptation

Fine tuning with only 62 hours of robot control data from the Droid data set
Integration of agent actions into the predictive skills
Enabling planning and closed control circuit control

Superior performance in practice

V-Jepa 2 demonstrates impressive performance in different areas:

Video understanding and motion detection

77.3% Top 1 accuracy in Something Something V2 data set
39.7% Recall-AT-5 for EPIC-Kitchens-100 action forecast (44% improvement compared to previous models)
State-of-the-art performance in various video questions response tasks

Robot control

65-80% success rate for pick-and-place tasks in unknown environments
Zero-shot robot control without ambient-specific training
Use in two different laboratories with Franka robot arms

Efficiency compared to the competition

V-Jepa 2 is 30 times faster than Nvidia's Cosmos model and only needs 16 seconds to plan a robot action, while Cosmos needs 4 minutes.

Technical innovations and key characteristics

The model is characterized by five central technical breakthroughs:

Self -monitored learning: eliminates the need for large amounts of labeled data
Masking mechanism: trains the model by predicting hidden video areas
Abstract representative learning: Focus on semantic meanings instead of pixel details
World model architecture: establishment of an internal understanding of physical laws
Efficient transfer learning: outstanding zero-shot learning skills

New benchmarks apparent limits of current AI

Meta has released three new benchmarks in parallel with V-Jepa 2 that test the physical understanding of AI systems:

Intphys 2

Tests the ability to distinguish between physically plausible and impossible scenarios. Even advanced models are still close to random level here.

Mvpbench

Visually uses similar video cars with opposing answers to the same question. V-Jepa 2 reaches 44.5% paired accuracy-the best performance of all tested systems.

Causalvqa

Examines causal understanding and counter -actual thinking. The results show that current AI systems can well describe what they see but have difficulty predicting alternative courses.

AI without hunger for data: How V-Jepa 2 Machine Learning makes more efficient

Yann Lecun sees the key to the next generation of AI development in world models like V-Jepa 2. The model could revolutionize different areas of application:

Robotics and budget assistants

World models are supposed to herald a new era of robotics in which AI agents can manage real tasks without astronomical amounts of training data.

Autonomous vehicles

The spatial understanding of real-time from V-Jepa 2 could be crucial for autonomous vehicles, warehouse robots and drone delivery systems.

Extended reality (AR) and virtual assistants

Meta plans to expand the functions of V-Jepa 2 by integrating audio analysis and expanded video understanding for AR glasses and virtual assistants.

Open source availability and research promotion

Meta has released V-Jepa 2 under the CC-by-NC license as an open source to promote global AI research. The model code is available on Github and can be executed on platforms such as Google Colab and Kaggle. This openness is in contrast to many other large AI models and is intended to promote the development of world models in robotics and embodied AI.

A paradigm shift in AI development

V-Jepa 2 represents a fundamental paradigm shift from pure language processing to a deeper understanding of the physical world. While most AI companies rely on generative models, Meta follows an alternative vision for the future of artificial intelligence with its world model approach. The ability to learn from minimal data and enable zero-shot robot control could pave the way for a new generation of intelligent systems that not only understand but can also act in the real world.

Suitable for: