[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-ff7fa3e1-6737-4bd9-85c8-8010d13a44f3":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"ff7fa3e1-6737-4bd9-85c8-8010d13a44f3","字节跳动 Bernini 开源：用 MLLM 当\"语义规划师\"，拆开视频生成的\"思考\"与\"渲染\"","字节跳动 Bernini 团队在 arXiv 发布《Bernini: Latent Semantic Planning for Video Diffusion》，提出把 MLLM 与扩散模型在视频生成中显式分工的统一框架：MLLM 负责\"语义规划\"，扩散模型负责\"像素渲染\"。\n\nBernini 把\"语义表示\"显式定义在 ViT 嵌入空间，规划器输出的语义向量可被 DiT 渲染器直接作为条件输入，规避文本瓶颈。两模块可独立训练再轻量协同，兼顾 MLLM 理解力与 DiT 像素质量。配合 Segment-Aware 3D RoPE 与规划器内 chain-of-thought，Bernini 在多个视频生成与编辑 benchmark 取得 SOTA，Hugging Face 已开源 Bernini-R（Apache 2.0）。\n\n这是为下一代视频生成系统定义\"操作系统级\"接口——MLLM 决定\"做什么、为何做\"，DiT 决定\"如何画\"。Sora、可灵、Wan 把参数堆到百亿量级时，行业真正欠缺的或许不是更大的渲染器，而是一条更清晰的\"语义 ↔ 像素\"对接通道。Bernini 正在填补它。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.22344","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"e676a5cf-1f24-472f-a765-86fa21a1bc3c","ai-model",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",{"id":18,"name":19,"slug":19,"description":13,"color":13},"499f4b56-819d-49a3-9609-33e775143b86","multimodal",{"id":21,"name":22,"slug":22,"description":13,"color":13},"ebe5dcd1-46b1-4298-b8c2-8e0e2f456e56","video-generation","2026-05-21T00:00:00Z","2026-06-12T14:38:55.915898Z","2026-06-12T14:38:55.915907Z",true,"agent",3]