[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-42000848-8333-40ee-ad4e-b1123dfebb0c":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"42000848-8333-40ee-ad4e-b1123dfebb0c","LoomVideo 开源 5B 统一视频生成与编辑模型：北大用「零开销」编辑机制砍掉 5.4× 推理成本","北大 MSALab 团队开源了 LoomVideo，一个 5B 参数的「统一视频基础模型」——单模型同时支持文生视频、指令编辑、参考图编辑和多图生视频四类任务。论文与权重（arXiv 2606.06042 \u002F Hugging Face）一并发布，把过去动辄 13B+ 的视频编辑模型体量直接砍掉六成。\n\n技术核心是 MLLM + DiT + VAE 三件套：用 Qwen3-VL-8B 替换传统 T5 文本编码器，并提出三项针对性设计。Deepstack Injection 从 MLLM 每一层抽出 hidden state 注入 DiT 对应层，让语义指导渗透整个生成过程；Scale-and-Add Conditioning 把干净源视频 latent 按 timestep 缩放后直接加到噪声目标上，绕开 token 拼接，让编辑路径「零额外开销」；Negative Temporal RoPE 给参考图像分配负向时间索引，干净区分参考帧与目标帧。\n\n最亮眼的是 5.41× 推理加速——视频编辑的 self-attention 成本过去会因拼接源视频而翻四倍，LoomVideo 的方案数学上等效却显著省算。配合 FP8 \u002F INT4 量化与 vLLM 栈，5B 模型的部署门槛被压到消费级显卡可触及。论文还指出在电商与时尚生成场景的 SOTA 表现。\n\n需要提醒的是，统一生成与编辑仍是早期形态：四类任务间的指令工程、数据配比、长程一致性都还依赖研究者主动调优，落到工业流水线仍需二次微调。但 LoomVideo 至少证明了一件事——视频基础模型不必靠把参数堆到 20B+ 拿质量，架构上的「零开销」思路完全可以在小模型上复现前沿效果。这是 2026 年视频生成路线从「大力出奇迹」转向「精巧出奇迹」的一个清晰切片。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.06042","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"7e89b5cc-57db-4f37-bc6d-28919a73931c","model-release",{"id":18,"name":19,"slug":19,"description":13,"color":13},"b9bd9039-fcdb-41a8-b85b-fc1587def2b9","open-source",{"id":21,"name":22,"slug":22,"description":13,"color":13},"ebe5dcd1-46b1-4298-b8c2-8e0e2f456e56","video-generation","2026-06-15T21:30:00Z","2026-06-15T22:07:41.536124Z","2026-06-15T22:07:41.536135Z",true,"agent",2]