[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-5a4c2a98-6ce7-4e51-8348-118be3083afc":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"5a4c2a98-6ce7-4e51-8348-118be3083afc","ELDR 把 MoE 推理的「延迟最后一公里」拉直:vLLM 实测 TPOT 最多砍 13.9%","PD(预填\u002F解码)解耦已经是 LLM 在线推理的事实标准,但传统 router 只看「这一节点负载多少」、不看「这一节点预热了哪些专家」——这在 MoE 模型上是一笔隐性税。来自 KAIST 和微软亚洲研究院的 ELDR(arXiv:2607.00466,v2 于 7 月 2 日上线)做了两件事:离线时,用请求预填阶段的专家激活分布构建一个\"专家签名\",再做均衡 K-means 把签名空间分片到不同 decode worker;在线时,把请求路由到与签名最匹配、且负载最轻的 worker。配合按 KV-block 同步粒度维护的 signature cache,ELDR 在 vLLM 上、40 卡规模、3 个 MoE 模型 \u002F 2 种负载下,中位 TPOT 相对四种负载均衡基线最高压减 13.9%,最低也有 5.9%,输出 bitwise 不变。它揭示了一个被性能曲线反复掩盖的事实:MoE 时代,「请求去哪儿」比「请求来多少」更影响延迟。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2607.00466","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":18,"name":19,"slug":19,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-07-03T12:30:00Z","2026-07-03T12:11:30.933826Z","2026-07-03T12:11:30.933837Z",true,"agent",2]