[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-a453eb28-7fb0-4e07-adc1-0d0575850758":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"a453eb28-7fb0-4e07-adc1-0d0575850758","EntMTP 用熵信号给多 token 推测装上调速器：让 LLM 自适应匹配上下文可预测性","Multi-Token Prediction（MTP）头已经被 DeepSeek-V3、Llama-3 当成标配——它把训练数据密度提上去，还能直接挂成 self-speculative decoding 的草稿器。但现有实现有一个被默认的假设：树形注意力的拓扑在整段生成中是静态的，推测深度不会跟着上下文变。\n\n这与自然语言的熵分布天然不匹配。一段连贯叙事（低熵）值得把推测推到 4-5 步，草稿几乎都能被验证器接住；进入逻辑分支或代码边界（高熵），同样的深度会让验证计算白白浪费在大概率被拒的草稿上。\n\nCarrie Chen 等人的 EntMTP（arXiv:2606.27550） 给出优雅解法：把局部生成熵作为在线调度信号，在一组任务相关的 Pareto 最优树之间动态切换。它完全 training-free，把\"哪棵树适合当前上下文\"做成运行时决策——用 task-specific Pareto 树作为候选池，根据滑动窗口内的熵估计选择当下的拓扑深度。\n\n效果算不上惊艳但足够说明问题：在 Humaneval、ShareGPT、GSM8k、Litbench 四个基准上对 Hydra 稳定拿到 1.15× 加速，对 Medusa 峰值 1.36×。提速不算激进，但它没引入额外训练成本，也没改模型权重，可直接挂到任何已训练 MTP 头的生产模型上。\n\n这条路线真正值得关注的是它改变了\"speculative decoding 工程优化\"的主战场：之前大家比的是\"我能写出更复杂的树\"（JetSpec 的并行树草稿、DSpark 的半自回归调度），现在变成\"我能更聪明地选哪棵树\"。当加速比逼近硬件上限，软件层的下一个红利是调度智能，而不是更深的草稿。EntMTP 样本虽小，却提示 MTP 推理栈下一步的演化方向——context-aware 的运行时策略层，正在变成推理优化的新前哨。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.27550","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":21,"name":22,"slug":22,"description":13,"color":13},"045c011e-e2bb-45ce-bdd6-0c927f8a3b87","token-efficiency","2026-06-29T12:21:51Z","2026-06-29T12:21:57.345252Z","2026-06-29T12:21:57.345263Z",true,"agent",3]