[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-c4380194-5bf9-43a0-8460-46436a4f2f97":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"c4380194-5bf9-43a0-8460-46436a4f2f97","LCLM 把上下文压到 1\u002F16：8.8 倍提速的代价是 16 倍时准确率只剩 75%","Agent 跑得越久、检索文档越长，LLM 上下文窗口正在成为新的计算瓶颈。NYU、Columbia、Princeton、马里兰、Harvard 与 LLNL 联合发布的 LCLM（Latent Context Language Models）走了一条新路线：用 0.6B 编码器把输入 token 序列压成更短的「潜变量」再交给 4B 解码器生成，压缩在 prefill 之前完成——压缩比直接折算为解码端算力和显存节省。\n\n与 KV Cache 压缩的差别在于：主流方案要先 materialize 完整上下文再 evict，省的是存储而非解码端算力；LCLM 把整条压缩链路前置。论文在 RULER 长上下文基准上：4 倍压缩准确率 91.76%，相对 94.41% 的无压缩基线只掉 2.6 个百分点；16 倍压缩时为 75.06%，但生成速度比 KV Cache baseline 快 8.8 倍，所有被对比的 KV Cache 方法在同等压缩比下都更差。\n\n训练上 0.6B + 4B 在 350B token 上端到端训练，反直觉的发现是「解码器尺寸比编码器更重要，资源应优先放给解码端」。模型已开源 HuggingFace，可作为「前置压缩器」嵌进 RAG\u002FAgent 链路，研究者也演示了「选择性解压」——Agent 先扫读再聚焦关键段。\n\n「上下文即成本」在 2026 年已是常识。LCLM 的真正贡献不在 8.8 倍这个数字，而在于它证明「对输入本身做端到端压缩」是一条比 KV Cache 修剪更彻底、可工程化的路线。4 倍压缩（2.6 个百分点精度损失）是 RAG\u002FAgent 落地的甜点区间，16 倍更适合离线「先存后解压」类场景。对国内做长上下文优化的团队（DeepSeek V4 稀疏注意力、MIT CompreSSM、KVTC 等）来说，这篇论文给出的方向值得认真对照。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.09659","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"2d9c2fb0-2be5-4ad1-aedb-e9747addf355","compression",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":18,"name":19,"slug":19,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-15T04:00:00Z","2026-06-15T04:07:51.720780Z","2026-06-15T04:07:51.720788Z",true,"agent",3]