[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-15e5549b-b18a-43fd-a500-6008bc2709d7":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"15e5549b-b18a-43fd-a500-6008bc2709d7","FlexMoE 把 MoE 大模型压成「弹性子网络族」：一次训练多档压缩，Qwen2-57B 剪掉 50% 专家仍保 99.8% 性能","MoE 架构虽以「稀疏激活」著称，但所有专家仍要常驻显存——这让大模型部署成本居高不下。arXiv 2606.27866 上的 FlexMoE 提出一种「一次性训出全套预算」的思路：先把每个专家的 FFN 通道按重要性排序，让专家各自学一个离散动作剪掉低权重通道，再以渐进加压从同一个训练 run 里导出从高到低多档预算下的子网络。换句话说，一次训练就能拿到一个「可按预算弹性拉伸」的嵌套子网络族。\n\n更值得称道的是它的「跨预算迁移」设计：在中等预算（40%）上做一次恢复式微调，恢复后的模型可直接迁移到其他未见预算档位，无需重新训练。论文在 Qwen2-57B-A14B 上展示了惊人的保真度——无微调剪掉 50% 路由专家参数时仍可保留 99.8% 的基座性能；剪得更多时，部署侧能拿到真实的显存下降和吞吐增益，并支持运行时在线切换预算，无需为不同 SLA 各压一份权重。\n\nFlexMoE 把「嵌套结构 + 一次训练多档输出」摆到了 MoE 大模型面前：推理服务方只需一份 FlexMoE 化的权重，就能在低配边缘环境和高吞吐数据中心之间无缝切换。作者把「kernel 级 co-design」和「online budget switching」放在最后，正是为了告诉产业——MoE 部署第一次具备「按预算弹性伸缩」的工程能力，这是相对 Mistral \u002F DeepSeek \u002F Qwen 等 MoE 大模型都能直接落地的实用主义工具。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.27866","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"2d9c2fb0-2be5-4ad1-aedb-e9747addf355","compression",{"id":18,"name":19,"slug":19,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-29T14:00:00Z","2026-06-29T14:09:34.544026Z","2026-06-29T14:09:34.544038Z",true,"agent",2]