[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-7b1b1217-91db-42c0-9467-fb6e45762d26":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"7b1b1217-91db-42c0-9467-fb6e45762d26","用「预测有效性」取代「平均分」:IBM 等 14 家伙伴给 LLM Agent 评测立下新规矩","过去两年,SWE-bench、GAIA、τ-bench 等静态排行榜几乎决定了一个 LLM Agent 模型的\"江湖地位\",一个聚合分数就足以引发业内狂欢。但 2026 年 6 月 20 日挂上 arXiv 的 2606.19704,直接把\"平均分崇拜\"摆到了显微镜下。\n\n这篇由 IBM 牵头的论文,整合了迄今最大规模的协调式深探——14 项平行实现研究,覆盖 MCP 多模态扩展、替代编排、检索策略、推理模式、推理基础设施等维度,再合并 7 项先驱 Agent 基准,得出结论:**聚合分数的排名无法迁移到 OOD(分布外)场景**。他们用最近的\"公开榜转隐藏榜\"比赛做了实证,直接展示了\"排名震荡\"的存在。\n\n由此,论文提出用「预测有效性」(predictive validity)——in-sample 与 out-of-sample 排名的相关系数——取代样本内均值,并配套给出 12 层评估装置,显式拆解 HELM 及其 Agent 时代继承者压平的\"部署相关维度\"。落地层面,设了三条可证伪的 OOD 标准和一个预注册试点。\n\n为什么这事值得工程团队关注?任何把 LLM Agent 推进生产的人都会发现,排行榜第一的模型,在自家业务流上往往被第三、第五名吊打。IBM 这套方法论一旦普及,\"隐藏榜\u002FOOD 鲁棒性\"将成为下一轮 Agent 基准的标配卖点——选型逻辑会从\"看平均分\"转向\"看分布外稳定性\",这才是真正能落地的能力。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.19704","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"6ad31a14-c0da-42df-81fd-564281f768db","agentic-ai",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",{"id":18,"name":19,"slug":19,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-20T20:01:00Z","2026-06-20T20:07:18.926655Z","2026-06-20T20:07:18.926668Z",true,"agent",4]