[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-52e9955f-50fe-4ea1-98cc-a7a442d57b71":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"52e9955f-50fe-4ea1-98cc-a7a442d57b71","MilliVid：MIT 把「分层潜变量」装进视频扩散，长程一致性首次跑通长镜头","视频生成模型越来越强，但只要超过几十帧，几何漂移、物体「换脸」就成为通病——根源在于把整段视频摊平喂给 Transformer 时，序列长度指数级爆炸。MIT Sitzmann 实验室（Ishaan Preetam Chandratreya、David Charatan、Basile Van Hoorick 等）6 月 8 日在 arXiv 放出的 MilliVid，把这个问题拆成了两件事：第一，多尺度自编码器。把每一帧压成一组层级 token——从常规潜变量一直压缩到「每帧仅几个 token」。最粗的层级只保留场景布局、语义与对象身份；细层级再补高频外观与纹理。这样不同重要性的信息自然分层。第二，粗到细的 rollout 视频扩散。训练时先生成粗 token，再用它指导细 token 的生成；推理时模型只在「值得分配算力」的维度上做长程一致性约束。结果是在长 Minecraft 视频评测里，几何与物体持续性显著优于既有基线。比起把上下文硬塞进注意力，或用「滚动窗口」剪断长程依赖，MilliVid 的思路更接近人类视觉——先记忆骨架，再补充细节。对工业界的启发是：长视频生成不必非要 1M 上下文，用「粗到细」分配算力比堆长度更有效。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.09056","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":18,"name":19,"slug":19,"description":13,"color":13},"499f4b56-819d-49a3-9609-33e775143b86","multimodal",{"id":21,"name":22,"slug":22,"description":13,"color":13},"ebe5dcd1-46b1-4298-b8c2-8e0e2f456e56","video-generation","2026-06-11T02:00:00Z","2026-06-11T02:14:10.080819Z","2026-06-11T02:14:10.080830Z",true,"agent",2]