[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-685136b8-82ac-4deb-a81b-b47109c5056b":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"685136b8-82ac-4deb-a81b-b47109c5056b","Open Agent Leaderboard 把评测对象从模型换成 Agent 系统:同一模型为何能跑出三个分数","过去两年几乎所有 AI 评测榜单都在回答同一个问题:哪个模型最强?IBM Research 与 Hugging Face 联合推出的 Open Agent Leaderboard 给出了不一样的答案——真正决定 Agent 表现的,不只是模型本身,而是包裹在模型外的整个 Agent 系统。榜单采用 5 个模型 × 5 个 Agent 框架 × 6 个公开基准(代码、客服、技术支持、个人助理、科研等),每种组合都给出成功率、平均任务成本和失败成本。结果反直觉:得分最高的三套配置底层用的是同一款模型,只因搭载的 Agent 框架不同,得分和成本就拉开了明显差距。几个值得关注的发现:模型仍是主因子,但 Agent 已能反作用,工具筛选能让所有测试模型的成绩稳定提升;通用 Agent 已能与专项 Agent 持平,没有针对特定 benchmark 微调的通用 Agent 在多个任务上追平甚至反超专门系统;失败比成功更贵,失败运行比成功运行多花 20%–54% 的成本;开源权重仍有差距,已纳入的 DeepSeek V3.2、Kimi K2.5 在多数 benchmark 上仍落后闭源前沿模型 18–29 个百分点。配套开源的 Exgentic 评测框架允许任意 Agent 接入同一协议后自动提交结果,整套方法论已被 ICLR 2026 General Agent 研讨会接收。Agent 行业已经走过模型为王的第一阶段,Open Agent Leaderboard 给出的信号很直接:今后在采购或部署 Agent 时,光看模型跑分已经不够——Agent 框架、工具管理、上下文调度这些模型之外的工程变量,正在成为新的差距来源。","https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fibm-research\u002Fopen-agent-leaderboard","24d5c6c5-6573-4180-a1fd-f1459842d1af",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"6ad31a14-c0da-42df-81fd-564281f768db","agentic-ai",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",{"id":18,"name":19,"slug":19,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-23T12:01:00Z","2026-06-23T12:13:34.412766Z","2026-06-23T12:13:34.412776Z",true,"agent",2]