▶ 原文链接

JEPA 世界模型前沿:因果对象交互与端到端无坍塌训练

来源: YouTube (Stanford CS25 课程讲座) | Hazel (Heejeong) Nam & Lucas Maes | Apr 22, 2026 播客: Stanford Online 分类: 其他 原文发表: Apr 22, 2026 纪要生成: 2026-06-22


全集重点


嘉宾/话题简介

Hazel (Heejeong) Nam 是布朗大学一年级硕士生,师从 Randall Balestriero 教授,研究方向包括表征学习、因果关系和自监督学习。Lucas Maes 是 Mila 和蒙特利尔大学三年级博士生,与 Damien Scieur 合作,并紧密联合布朗大学的 Randall Balestriero。本期讲座围绕 JEPA(Joint-Embedding Predictive Architecture) 展开,Hazel 重点介绍 Causal-JEPA 如何通过对象中心潜变量干预学习世界模型,Lucas 则介绍 LeWorldModel 如何以极简方式实现端到端无坍塌 JEPA 训练。


分节详述

00:00 开场与嘉宾介绍

本节重点

详细精要

💬 精华片段(中文)

"I'm a bit nervous. This is my first time giving a talk in English, but at the same time, I'm really, really excited to talk about JEPA world model and my recent work, causal world model."
“我有点紧张,这是我第一次用英语演讲,但同时我也非常激动能够谈谈 JEPA 世界模型和我最近的工作——因果世界模型。”


01:36 世界模型基础与设计三要素

本节重点

详细精要

💬 精华片段(中文)

"The world model basically is a function that gets the previous state and the action to predict the next state. ... I perceive the world model terminology as a simulator."
“世界模型本质上是一个函数,接收先前状态和动作来预测下一状态……我将世界模型这个概念视为一个模拟器。”


03:46 生成式世界模型 vs. JEPA

本节重点

详细精要

💬 精华片段(中文)

"JEPA tries to deal with only having predictive information in your latent space so that your prediction is getting more meaningful and human-like way."
“JEPA 试图在潜空间中仅处理可预测的信息,从而使预测变得更有意义,更像人类的思维。”


06:55 V-JEPA 与 DINO World Model 中的坍塌预防

本节重点

详细精要

💬 精华片段(中文)

"What they claim is, oh, do we actually have to train the JEPA encoder to get the meaningful abstraction for planning? They said, no, a pretrained DINO encoder can do that role as well."
“他们宣称:我们真的必须训练 JEPA 编码器才能获得有意义的规划抽象吗?他们说,不需要,预训练的 DINO 编码器同样能胜任。”


10:07 Causal-JEPA:动机与数据集

本节重点

详细精要

💬 精华片段(中文)

"To understand this mechanism is you want to understand the things like this. You have each object and you want to know how one object influences each other."
“要理解这种机制,你需要的是:识别出每一个对象,并弄清楚一个对象如何影响另一个。”


12:16 对象中心学习与 Slot Attention

本节重点

详细精要

💬 精华片段(中文)

"There's a mechanism called slot attention that puts each feature to each slot. So there's a binding problem of feature to slot."
“有一种叫 slot attention 的机制,将每个特征放到对应的槽位中,这就涉及特征到槽位的绑定问题。”


14:20 “猴子吃香蕉”:Causal-JEPA 的核心动机

本节重点

详细精要

💬 精华片段(中文)

"If model is truly understanding this eating mechanism, the model might be able to infer what is happening to banana when we cover the invisible cloth on banana."
“如果模型真正理解了这种进食机制,那么当我们用布遮住香蕉时,它也能推断出香蕉正在发生什么变化。”


15:25 Causal-JEPA:架构与掩码策略

本节重点

详细精要

💬 精华片段(中文)

"But when we are trying to do positional encoding in the slot axis, there's a problem, because what object-centric models are doing is they do not define the order of the objects, but rather the object-centric models are permutationally equivalent depends with respect to the object orders."
“当我们在槽位轴上进行位置编码时会出现问题,因为对象中心模型并不定义对象的顺序,而是对对象顺序具有置换等价性。”


20:09 动作条件:从拼接特征到独立节点

本节重点

详细精要

💬 精华片段(中文)

"Why don't we consider action as another node of the graph? The Causal-JEPA does not recover any true causal graph, but its motivation is grounded in the causal graph."
“我们为什么不把动作视为图中的另一个节点?Causal-JEPA 并没有恢复任何真实的因果图,但其动机植根于因果图。”


22:16 实验结果:CLEVRER 反事实推理与 Push-T 控制

本节重点

详细精要

💬 精华片段(中文)

"After we changed the action-conditioning method, treating them as a separated node, and we change the transformer to the bidirectional transformer, the performance gain is significant. It gains this 15% of absolute percentages."
“在我们改变了动作条件方法,将动作作为独立节点处理,并将 Transformer 改为双向之后,性能提升显著,绝对百分比提升了 15%。”


25:57 PHYRE 物理合理性实验与注意力探针

本节重点

详细精要

💬 精华片段(中文)

"By the training method of object masking, you keep asking the question to the model, what would happen if this doesn't exist? What should you consider to predict the masked token? It can learn the true dynamics."
“通过对象掩码的训练方法,你不断向模型提问:如果这个不存在会怎么样?你需要考虑什么来预测被掩码的令牌?这样模型才能学到真实的动力学。”


28:36 形式化假设与回应关键问题

本节重点

详细精要

💬 精华片段(中文)

"The largest limitation is coming from the object-centric encoder. The object-centric representation does not work really well on the occlusion situation. And, in the middle of the video, some objects can appear and disappear. But this slot attention cannot handle this scenario really well."
“最大的局限来自对象中心编码器。对象中心表示在遮挡情况下效果不佳,而且视频中间会有对象出现和消失,而 slot attention 无法很好地处理这种情况。”


33:27 LeWorldModel:极简 JEPA 的动机

本节重点

详细精要

💬 精华片段(中文)

"It's just a simple JEPA that doesn't use any tricks. So there is no Exponential Moving Average, no masking, no stop gradient, no pretrained encoder, and also no unstable loss. Why? Because we have a single hyperparameter."
“它就是一个简单的 JEPA,不使用任何技巧。没有指数移动平均,没有掩码,没有 stop gradient,没有预训练编码器,也没有不稳定的损失。为什么?因为我们只有一个超参数。”


39:09 LeWorldModel 架构与伪代码

本节重点

详细精要

💬 精华片段(中文)

"So if you look at the pseudocode on the right, it's actually not that much pseudocode. It's literally the true code. ... at the return, I have a single hyperparameter lambda. So this is the only stuff you need to tune."
“如果你看右边的伪代码,这其实算不上伪代码,它就是真实代码。……在返回处,我只有一个超参数 λ,这就是唯一需要调的。”


40:42 SIGReg:基于投影的高维高斯正则化

本节重点

详细精要

💬 精华片段(中文)

"There is a theorem called Cramér-Wold theorem that says that if you optimize the marginals to be Gaussian, then the joint is going to be Gaussian."
“有一个定理叫 Cramér-Wold 定理:如果所有边际分布都是高斯的,那么联合分布也会是高斯的。”


43:20 世界模型评估:在线控制方法

本节重点

详细精要

💬 精华片段(中文)

"Because your predictor is differentiable, you can for instance backpropagate until the action try to -- sequence of action to minimize the distance with your goal."
“因为你的预测器是可微的,所以你可以通过反向传播一直传到动作序列,来最小化与目标的距离。”


46:57 控制实验结果与规划时间优势

本节重点

详细精要

💬 精华片段(中文)

"We can go to a full playing time under the second, which is very nice. It's almost 50 times faster."
“我们可以达到整个规划时间在一秒以内,这非常棒,几乎快了 50 倍。”


50:45 直观物理理解探针与违规实验

本节重点

详细精要

💬 精华片段(中文)

"But if suddenly the cube teleports, then the prediction error shoot a lot, meaning that your world model didn't predict that. Some people say to me often that yeah, but it's just out-of-distribution. And I would say it's true, but I think it's not very meaningful to say that because as human, when you violate your model, it's also very out-of-distribution."
“但如果方块突然传送,预测误差急剧升高,意味着你的世界模型没有预料到。有人常对我说:‘这只是因为分布外啊。’ 我会说确实如此,但这么说没什么意义,因为当人类遇到违背其世界模型的情况时,同样是分布外的。”


54:24 t-SNE 可视化与未来预测解码

本节重点

详细精要

💬 精华片段(中文)

"If you are very careful, you can see that at frame 15 and 20, the angle of the gripper is opposite. And so basically, you can see that the world model didn't learn the rotation of the gripper, which was pretty interesting, because it still was able to solve somewhat the environment."
“如果你仔细观察,会发现在第 15 和 20 帧,夹具的角度是相反的。所以基本上可以看到,世界模型并没有学会夹具的旋转,这很有意思,因为它仍然在一定程度上能完成任务。”


56:33 LeWorldModel 局限与 stable-worldmodel 库

本节重点

详细精要

💬 精华片段(中文)

"For instance, when you think about oh, I need to go to the airport, you think at a different hierarchy. ... So we need that as well to be able to predict further in the future."
“比如当你想着‘哦,我要去机场’时,你是在不同层级上思考的……所以我们同样需要这种层次性,才能对未来进行更远的预测。”


59:43 Q&A:世界模型与物理 AI、掩码必要性、规划与策略

本节重点

详细精要

💬 精华片段(中文)

"As human, why you are very good at what you do is because you can predict what is the consequence of your action in the real world. And that's what world model try to do. VLA don't do that. So if you want to have physical AI, basically, you need world model. You cannot bypass that."
“人类之所以擅长做各种事情,是因为我们能预测自己在真实世界中行动的后果,这就是世界模型试图做到的。VLA 不这么做。所以若想实现物理 AI,你基本上无法绕开世界模型。”


专业术语注释

术语 解释
JEPA (Joint-Embedding Predictive Architecture) 联合嵌入预测架构,在潜空间中预测未来状态,而非直接生成像素,致力于建模世界动态且忽略无关细节
World Model 世界模型,接收前一状态和动作来预测下一状态的函数,视为环境的模拟器
Causal-JEPA 因果联合嵌入预测架构,通过对象中心表示与对象掩码训练模型理解对象间的时序定向预测依赖
Object-Centric Representation 对象中心表示,将场景分解为以对象为单位的独立表示,而非图块化的特征
Slot Attention 槽位注意力,一种通过学习将特征绑定到多个对象槽位的机制,形成对象对齐表示
EMA (Exponential Moving Average) 指数移动平均,用于缓慢更新目标编码器,防止表示坍塌的常见技巧
Stop Gradient 停止梯度,阻止梯度流向目标编码器,是防坍塌的另一种手段
V-JEPA / V-JEPA 2 基于视频的 JEPA,使用时空掩码和 EMA 等措施,2 版本增加了动作条件控制后训练
DINO World Model 采用冻结的 DINOv2 编码器提供 patch 表示,用 Causal Transformer 预测未来状态的简化世界模型
Energy-Based Model (EBM) 能量模型,学习一个能量函数,对兼容对(合理未来)赋低能量,反对高能量;JEPA 可被理解为一种 EBM
SIGReg (Sketched Isotropic Gaussian Regularizer) 基于随机投影的各向同性高斯正则化项,通过在多随机方向上强制边际分布为高斯,使潜空间整体呈高斯分布,防止坍塌
Cramér-Wold Theorem Cramér-Wold 定理,陈述多维分布的边际分布决定联合分布,SIGReg 的理论基础
LeWorldModel 所提出的极简 JEPA 实现,无 EMA、掩码、stop gradient 和预训练编码器,仅靠 SIGReg 和单一超参数端到端训练
PLDM 一种端到端的自监督世界模型方法,使用 VICReg 等多重损失防止坍塌,但损失项多、调参困难
Object Masking 对象掩码,Causal-JEPA 中故意掩盖某些对象槽位的未来状态,迫使模型通过其他对象推断,以学习对象间动力学
Influence Neighborhood 影响邻域,为正确预测掩码令牌所需关注的最小充分对象集合,即预测充分集
Proprioception 本体感知,通常指机器人关节位置、速度等自身状态信号;许多方法依赖此信息,LeWorldModel 未使用
Model Predictive Control (MPC) 模型预测控制,一种经典控制方法,通过模型预测未来并优化控制序列,JEPA 世界模型可自然结合 MPC 进行规划

延伸思考

原文发表:Apr 22, 2026  ·  纪要生成:2026-06-22