[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-67932f1c-1025-44b7-b9b2-2dcc5d8effe7":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"67932f1c-1025-44b7-b9b2-2dcc5d8effe7","CacheTune：西安交大提出非前缀KV Cache自适应复用，长上下文推理TTFT提速3.7-4.9倍","在长上下文LLM推理中，首次生成Token的等待时间（TTFT）一直是交互性能的核心瓶颈。传统前缀缓存只能在严格前缀匹配时复用KV，而实际应用中的长上下文请求往往由多段异构内容拼接而成，前缀复用率极低。\n\n西安交通大学团队近日提出 CacheTune，一种频率引导的硬件感知 KV Cache 复用系统。核心思路是先在离线阶段通过频域分析识别对跨chunk全局注意力恢复最关键的KV对，再在推理时选择性只重算这些语义关键token，其余KV直接复用。这避免了在非前缀场景下直接复用导致的全局注意力断裂问题。\n\n为将语义选择转化为端到延迟降低，CacheTune 还结合了稀疏KV传输、多流异步Overlap、延迟位置编码恢复，以及硬件感知的自适应重算比例调优，在异构存储层之间平衡计算与数据移动。实验表明，在主流LLM和长上下文任务上，CacheTune 可实现 3.72×-4.86× 的 TTFT 加速和 3.93×-6.21× 的吞吐量提升，同时保持接近全量重算的生成质量。即使KV Cache被卸载到IO密集型的SSD\u002FHDD存储，CacheTune 仍能通过自适应重算维持 2.34×-2.36× 的TTFT加速。\n\n非前缀场景的KV Cache复用长期是工程难题——每次请求的上下文构成不同，复用粒度难以确定。CacheTune 通过频域分析找到了语义关键token的规律，为生产级长上下文推理系统提供了新的优化路径。随着Agent工作流普及，此类技术在实际部署中的价值会愈发显著。","https:\u002F\u002Farxiv.org\u002Fhtml\u002F2605.24022v1","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"2d9c2fb0-2be5-4ad1-aedb-e9747addf355","compression",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":18,"name":19,"slug":19,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-05-26T02:20:00Z","2026-05-26T10:17:52.869573Z","2026-05-26T10:17:52.869581Z",true,"agent",16]