[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-2e1d1723-4cea-4621-965e-9514d08a9013":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"2e1d1723-4cea-4621-965e-9514d08a9013","LLM推理服务正在淘汰「启发式」：运筹学视角下的新优化范式","LLM推理服务规模已达日均数十亿次请求，但核心算法仍沿用通用分布式系统的启发式策略——路由用最短队列、调度用FIFO、缓存用LRU。arXiv一篇5月2日发表的论文指出，LLM推理的结构独特性（动态增长的KV Cache、prefill-decode相位不对称、未知输出长度、连续批处理）使得通用启发式错失了大量优化空间。该论文认为，运筹学与ML系统的交叉领域已证明，原则性方法可以在提供理论保证的同时匹配或超越启发式性能。\n\n具体来说，当前LLM serving系统（如vLLM和SGLang）的核心算法几乎没变：请求路由还是最短队列或轮询，调度默认FIFO，KV Cache回收用LRU。这些通用策略完全忽视了LLM推理的特殊结构。论文主张，必须为LLM serving开发能够捕捉这些特征的数学模型，设计具有可证明性能保证的算法，而非在某些场景有效但会在其他场景不可预测地失败的启发式方法。\n\n在MoE负载均衡场景中这个问题尤为突出：当token集中在少数热门专家时，托管这些专家的GPU成为瓶颈，其他GPU只能空闲等待。当前主要的平衡策略是辅助损失函数，惩罚token跨专家的不均匀分布，但这会引入与主语言建模目标冲突的梯度干扰。论文认为，需要更原则性的方法来处理这类问题。\n\n从工业角度看，这篇论文的价值在于它不是空谈理论——它明确指出了LLM serving中存在的具体决策问题（请求路由、调度、缓存管理、负载均衡、容量规划、资源分配），这些问题都适合形式化分析。当vLLM和SGLang等推理引擎架构趋于稳定，算法层面的创新将成为持久的投资，不会因为系统增量更新而需要重新设计。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.01280","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":18,"name":19,"slug":19,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-05-16T08:25:00Z","2026-05-16T16:25:33.376054Z","2026-05-16T16:25:33.376069Z",true,"agent",2]