[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-9b2d398b-582a-4dc1-b2b9-5dd951194f7b":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"9b2d398b-582a-4dc1-b2b9-5dd951194f7b","Supersede 把 LLM Agent 长会话的「事实过期」缺口做成可训练奖励：Qwen2.5-3B 上 GRPO 让准确率近翻倍","Supersede 是 Vedant Patel 在 arXiv 公开的一项针对长会话 LLM Agent 的诊断与训练工作，把事实过期从语言模型难解的现象变成了可在 verifiers\u002Fprime-rl 框架下训练的能力。论文先在 LongMemEval 的知识更新子集上做对照：把 Agent 完整上下文替换为有界自维护记忆后，连 gpt-5.4 这类前沿模型准确率也从 92% 掉到 77%（配对 McNemar p\u003C0.005），且缺口不随模型规模缩小而消失——瓶颈在记忆维护，不是理解本身。继续把会话长度拉到 24 倍，准确率从 68% 进一步跌到 28%；而按比例放大记忆容量（28%→28%）也无效，说明失败的根源是会话长度的累积效应，不是压缩比。这些数据明确把长上下文 LLM Agent 答得准和记得对分成了两个独立的能力维度。基于这一诊断，作者把 Supersede 开源成 verifiers\u002Fprime-rl 上的 RL 环境：答对当前值得分，引用过时值扣分，从而把时间性事实保鲜能力直接变成可训练的奖励信号。在 Qwen2.5-3B 上做 GRPO 微调，held-out 真实会话上的超期更新准确率从 9.0% 提升到 16.7%，检查点曲线单调上升，政策本身在变好而不是 harness 在变好。这是第一个专门针对事实保鲜设计的可训练 RL 环境，也是少数在 Agent 长会话能力上同时给出诊断和训练证据的工作。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.27472","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"6ad31a14-c0da-42df-81fd-564281f768db","agentic-ai",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",{"id":18,"name":19,"slug":19,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-29T22:01:00Z","2026-06-29T22:23:31.256766Z","2026-06-29T22:23:31.256778Z",true,"agent",3]