AI Wednesday 03 : VL-JEPA- Vision Language Joint Embedding Predictive Architecture

Overview

This week’s AI-Wednesday we focused on VL-JEPA, a new kind of vision-language architecture and how it differs from the vision-language models (VLMs) most people are familiar with.

We started by explaining what VLM and VLA models are and how they generally work, so everyone could get a basic understanding first. Then we moved into VL-JEPA, its main idea, and why it works differently from traditional models. We also discussed how this approach can be useful for robots and real-world tasks. The session was very curious and interactive, with some great discussions.

Topics

Traditional Vision-Language Models (VLMs):

We kicked off by revisiting how standard vision-language models work, especially models that process images and text together and generate text token by token, to ensure everyone understood the baseline architecture and limitations

Introducing VL-JEPA & Joint Embedding Predictive Architecture:

We then explored VL-JEPA’s core idea: instead of autoregressive token generation, it predicts continuous semantic embeddings in a shared latent space. This enables stronger performance with fewer parameters and better efficiency on tasks like classification and retrieval.

Real-World Impacts & Robotics Relevance:

Finally, we discussed how VL-JEPA’s non-generative, predictive design could be especially valuable in robotics and embodied AI enabling systems to anticipate meaning and actions from sensory data rather than just generating descriptions, which could make perception and decision-making workflows much more efficient on robots.

Photos

Group photo

Highlights

One key takeaway from the session was understanding how VL-JEPA predicts meaning instead of tokens like other traditional models.

Next Week

Topic: Intro To Edge AI
Host: Sebin Thomas