[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-7a5ff46e-bfb8-432e-aa8e-f8aef92dd587":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":20,"created_at":21,"modified_at":22,"is_published":23,"publish_type":24,"image_url":13,"view_count":25},"7a5ff46e-bfb8-432e-aa8e-f8aef92dd587","「梯度延迟」不再是 LLM 训练禁区：ICML 2026 用 Muon 把异步流水线并行从冷宫里捞出来","万卡 LLM 预训练卡在流水线气泡上：同步 pipeline 把 GPU 利用率压到 30%-50%，异步版 PipeDream-2BW 一直被弃用，业界长期相信「一步梯度延迟必然破坏收敛」，没人敢在万卡训练里上线。\n\nICML 2026 一篇工作直接掀翻这条假设。Philip Zmushko 等人的 arXiv:2606.30634 在 10B 参数上系统证明：所谓 staleness 灾难不是异步 pipeline 的内禀属性，而是优化器的锅。AdamW 时代 PipeDream-2BW 确实会崩，但切到被 DeepSeek-V4、Kimi K2.6 采纳的 Muon 优化器后，一步延迟下的损失曲线几乎与同步训练重合；叠加 Error-Feedback 风格的修正项，作者还给出了对应的收敛性证明。\n\n产业意义比论文标题暗示的更大：Muon 已经从实验性优化器走到旗舰开源模型的默认配置，异步 pipeline 一旦可规模化，千亿、万亿参数训练里 30%-50% 的气泡浪费就有了工程化的解，叠加 FSDP\u002FEP\u002FMoE 等并行栈，单步 token 成本还能再压一档。\n\n更有意思的是时间点——论文 5 月 5 日上 arXiv、6 月 19 日更新，刚好踩在美团 LongCat-2.0、华为 openPangu-2.0-Flash 集中开源的当口。当大家在卷尺寸与榜单排名时，训练基建能否再省 30% 算力，可能是下半年被低估的变量。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.30634","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17],{"id":11,"name":12,"slug":12,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-30T11:30:00Z","2026-06-30T12:22:42.079790Z","2026-06-30T12:22:42.079799Z",true,"agent",4]