For robots and other AI agents: Meta's V-JEPA 2 AI model – The AI that understands our physical world

Published on: June 13, 2025 / Updated on: June 13, 2025 – Author: Konrad Wolfenstein

For robots and other AI agents: The V-JEPA 2 AI model from Meta – The AI that understands our physical world – Image: Xpert.Digital

Meta presents V-JEPA 2: AI system learns to make predictions about the physical world

Meta publishes V-JEPA 2: A revolutionary AI world model for the future of artificial intelligence

Meta has unveiled V-JEPA 2, a groundbreaking AI system that takes a fundamentally different approach than conventional Grand Language Models. This world model, with its 1.2 billion parameters, was developed to help robots and other AI agents understand the physical world and predict how it will react to their actions.

What is V-JEPA 2 and how does it differ from language models?

V-JEPA 2 stands for “Video Joint Embedding Predictive Architecture 2” and is based on a completely different architecture than traditional language models. While language models like ChatGPT or GPT-4 make probabilistic predictions about text sequences, V-JEPA 2 operates in an abstract representational space and focuses on understanding physical laws.

The crucial difference lies in the learning method: language models require large amounts of labeled data and learn through supervised training. V-JEPA 2, on the other hand, uses self-supervised learning and extracts knowledge from unlabeled videos, thereby significantly reducing data preparation costs. The model learns not through pixel reconstruction, but through abstract representations of the video content.

The JEPA architecture: Learning through prediction

The Joint Embedding Predictive Architecture (JEPA) was developed by Yann LeCun, Meta's Chief AI Scientist, and represents an alternative to generative AI models. Unlike generative approaches, which attempt to reconstruct every missing pixel, V-JEPA 2 works with masked video regions and learns to predict abstract concepts.

The system uses a two-stage training approach:

First phase: Self-supervised learning

Training with over one million hours of video material and one million images
Learning physical interaction patterns without human annotation
Development of an internal model of the physical world

Second phase: Action-induced adaptation

Fine-tuning with only 62 hours of robot control data from the DROID dataset
Integration of agent actions into predictive capabilities
Enabling planning and closed-loop control

Superior performance in practice

V-JEPA 2 demonstrates impressive performance in various areas:

Video understanding and motion detection

77.3% Top 1 accuracy in Something-Something v2 dataset
39.7% Recall-at-5 in Epic-Kitchens-100 action prediction (44% improvement over previous models)
State-of-the-art performance in various video question-and-answer tasks

Robot control

65-80% success rate in pick-and-place tasks in unfamiliar environments
Zero-shot robot control without environment-specific training
Deployment in two different laboratories with Franka robot arms

Efficiency compared to the competition

V-JEPA 2 is 30 times faster than NVIDIA's Cosmos model and only needs 16 seconds to plan a robot action, while Cosmos takes 4 minutes.

Technical innovations and key features

The model is characterized by five key technological breakthroughs:

Self-supervised learning: Eliminates the need for large amounts of labeled data
Masking mechanism: Trains the model by predicting hidden video areas
Abstract representation learning: Focus on semantic meanings instead of pixel details
World model architecture: Building an internal understanding of physical laws
Efficient transfer learning: Outstanding zero-shot learning abilities

New benchmarks reveal the limits of current AI

In parallel to V-JEPA 2, Meta has released three new benchmarks that test the physical understanding of AI systems:

IntPhys 2

It tests the ability to distinguish between physically plausible and impossible scenarios. Even advanced models still perform close to randomness in this regard.

MVPBench

It uses visually similar video pairs with opposing answers to the same question. V-JEPA 2 achieves 44.5% Paired Accuracy – the best performance of all systems tested.

CausalVQA

The study examines causal understanding and counterfactual reasoning. The results show that current AI systems can describe what they see well, but have difficulty predicting alternative outcomes.

AI without data hunger: How V-JEPA 2 makes machine learning more efficient

Yann LeCun sees world models like V-JEPA 2 as the key to the next generation of AI development. The model could revolutionize various application areas:

Robotics and household assistants

World models are intended to usher in a new era of robotics, in which AI agents will be able to handle real-world tasks without astronomical amounts of training data.

Autonomous vehicles

V-JEPA 2's real-time spatial understanding could be crucial for autonomous vehicles, warehouse robots, and drone delivery systems.

Augmented Reality (AR) and virtual assistants

Meta plans to expand the functionality of V-JEPA 2 by integrating audio analytics and enhanced video understanding capabilities for AR glasses and virtual assistants.

Open-source availability and research funding

Meta has released V-JEPA 2 as open source under the CC-BY-NC license to promote global AI research. The model code is available on GitHub and can be run on platforms such as Google Colab and Kaggle. This openness contrasts with many other large AI models and is intended to advance the development of world models in robotics and embodied AI.

A paradigm shift in AI development

V-JEPA 2 represents a fundamental paradigm shift from pure language processing to a deeper understanding of the physical world. While most AI companies rely on generative models, Meta pursues an alternative vision for the future of artificial intelligence with its world-model approach. The ability to learn from minimal data and enable zero-shot robot control could pave the way for a new generation of intelligent systems that can not only understand but also act in the real world.

Suitable for: