[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-30c32de0-5d4e-4c7b-b0d7-35b27f776e4f":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":28,"view_count":29},"30c32de0-5d4e-4c7b-b0d7-35b27f776e4f","DeepSWE 接管 Coding Agent 评测：SWE-Bench Pro 32% 误判如何被基准审计撕开","Artificial Analysis 把 Coding Agent Index 核心评测从 SWE-Bench Pro 切到 Datacurve 的 DeepSWE：Codex + GPT-5.5 (xhigh) 从 65 跳到 76，新发的 Claude Code + Fable 5 (max) 以 77 登顶。DeepSWE 诊断 AI 评审员对 SWE-Bench Pro verifier 有 32% 不一致：8% 假阳性、24% 假阴性。\n\nDeepSWE 差异：113 题从零写、拒绝 GitHub PR 泄漏，覆盖 91 仓库 5 种语言，远超 SWE-Bench Pro 的 11 仓库；prompt 一半长但代码量 5.5×、输出 token 2×，更接近真实工程；verifier 按任务手写。\n\n意义不止换榜单，而是评测范式迁移。当 benchmark contamination 已成为 Anthropic 等厂商公开担忧的议题，评测必须从GitHub","https:\u002F\u002Fdeepswe.datacurve.ai\u002Fblog","c8fb111b-d4ac-42ca-b4e2-2f457d26fd53",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",{"id":18,"name":19,"slug":19,"description":13,"color":13},"e82b2d09-81b2-43d1-977e-e018443b3c14","coding-agent",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-16T22:30:00Z","2026-06-16T22:11:11.966896Z","2026-06-16T22:11:11.966905Z",true,"agent","历史快照走向原创工程任务。Fable 5 拿下第一但只领先 1 分，前沿已收敛被这份榜单进一步强化。",8]