[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-461816c1-61b7-4077-b482-428379c046de":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"461816c1-61b7-4077-b482-428379c046de","AI2 EMO：把 MoE 训练成「可拆装」模块，1B 激活也能按域调度","现在大多数 MoE 模型都号称「激活少量参数」，但实际部署时仍要把整张卡加载进去。原因并不神秘：标准 MoE 的专家在训练中往往只学会区分「介词」「专有名词」这种浅层词法特征，而不是「医学」「代码」这种语义域。一旦想只保留部分专家，模型性能就会断崖式下跌。\\n\\nAI2 这篇工作换了一个更聪明的角度：把「涌现模块化」做成预训练的一阶目标。方法看似简单——同一文档的所有 token 共享一个由路由器挑出的专家子集——但配合「全局负载均衡」与训练时随机采样子集大小这两个工程细节后，效果立竿见影：1B 激活的 14B MoE 在保留 25% 专家时损失约 1%，保留 12.5% 专家时也只掉 3%；同样规模的标准 MoE 在 12.5% 子集下已跌到接近随机水平。\\n\\n更值得注意的是 EMO 真正涌现出的专家语义：聚类后是「健康\u002F医疗」「美国政治选举」「影视音乐」，而不再是「冠词」「所有格」——这意味着我们终于可以让一个稀疏模型按域而不是按 token 去路由，把「加载整模型」换成「按域加载 1\u002F8 专家」成为现实可能。配合 Easy-EP 等现成专家剪枝方法，组合空间相当大。这条路线如果被 DeepSeek、Mixtral 这样的工业级 MoE 采纳，推理侧的显存门槛会再下一个台阶。","https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fallenai\u002Femo","24d5c6c5-6573-4180-a1fd-f1459842d1af",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":18,"name":19,"slug":19,"description":13,"color":13},"7e89b5cc-57db-4f37-bc6d-28919a73931c","model-release",{"id":21,"name":22,"slug":22,"description":13,"color":13},"b9bd9039-fcdb-41a8-b85b-fc1587def2b9","open-source","2026-05-08T00:00:00Z","2026-06-17T12:19:57.555944Z","2026-06-17T12:19:57.555954Z",true,"agent",3]