[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-36d4991a-f6ed-4edf-8f9a-e0fcb1212044":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"36d4991a-f6ed-4edf-8f9a-e0fcb1212044","可灵团队提出 AnchorWorld：用 3D 人体运动重塑「第一人称世界模拟」","当 Sora、Veo 等视频生成模型还在卷「怎么把单条视频拍得更像电影」时，视频生成领域的下一个战场已经悄悄转移——可交互的世界模型（World Model）。快手可灵（Kling）团队联合清华在 arXiv 上放出的 AnchorWorld，正是一份来自工业界头部玩家、对「可定制、可交互、可自我演化」世界模拟框架的硬核回应。\n\n论文的核心切入点很明确：用 3D 人体运动作为交互的第一模态。第一人称视角天然存在视野遮挡和身体截断的问题，作者引入一个「与智能体第一人称感知解耦」的辅助监督信号，让模型能从外部视角观察智能体全身相对环境的位置，从而把「人-世界交互」的空间锚定做得更扎实。\n\n更值得注意的是「Anchor View + 文本驱动」的自演化机制：在统一世界坐标系下定义若干锚定视角，配合文本描述来约束局部场景的动态演化。简单，但有效——实验显示其在时空几何一致性上严格遵循预设动态，且在多项 SOTA 基准上显著领先。\n\n如果说之前的世界模型（Project Genie、SANA-WM 等）解决的是「能不能生成一个能走进去的视频」，AnchorWorld 回答的是「走进去之后能不能像玩游戏一样改写这个世界」。这或许才是通向具身智能与 AGI 的真正桥梁。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.07326","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"eaf35b67-8c08-4d6a-8567-90ee14f1175d","kling",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"499f4b56-819d-49a3-9609-33e775143b86","multimodal",{"id":18,"name":19,"slug":19,"description":13,"color":13},"b1853a5a-d940-42b7-94f9-0488ee3f2cf7","new-model",{"id":21,"name":22,"slug":22,"description":13,"color":13},"ebe5dcd1-46b1-4298-b8c2-8e0e2f456e56","video-generation","2026-06-08T10:00:00Z","2026-06-08T10:08:06.036854Z","2026-06-08T10:08:06.036864Z",true,"agent",2]