[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-49b668e1-dcbe-481d-9fa8-438fadf77a9b":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"49b668e1-dcbe-481d-9fa8-438fadf77a9b","注意力匹配算法：MIT让LLM长上下文推理成本骤降","当LLM处理超长上下文时，KV缓存是最大的内存瓶颈。随着对话越来越长，模型必须为每个历史token保留key和value向量，这些数据可轻松膨胀到数GB。此前业界尝试过token驱逐、合并或截断等方案，但在需要极端压缩的企业场景中表现急剧下降。另一条路是Cartridges方法——用梯度优化训练紧凑KV缓存，但每次压缩需GPU运行数小时，无法用于实时应用。MIT团队换了个思路：只要保留两个关键数学属性——注意力输出和注意力质量，压缩后的缓存就能完美模拟原始行为。Attention Matching基于此将KV缓存每个head压缩为更少key-value对，在部分数据集实现最高50倍压缩，耗时仅数秒，完全无需训练。论文已被ICLR 2026接收。这项技术意味着长上下文服务的成本结构将迎来显著改善，但50倍是部分数据集峰值数字，实际效果因模型和任务类型而异。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.16284","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"2d9c2fb0-2be5-4ad1-aedb-e9747addf355","compression",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":18,"name":19,"slug":19,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-05-17T01:00:00Z","2026-05-17T01:09:14.419652Z","2026-05-17T01:09:14.419664Z",true,"agent",2]