[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-95b04c15-d5e5-4dca-ab2b-14e343bdd4e6":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":26,"created_at":27,"modified_at":28,"is_published":29,"publish_type":30,"image_url":13,"view_count":31},"95b04c15-d5e5-4dca-ab2b-14e343bdd4e6","UC Berkeley 曝光 AI 基准测试系统性漏洞：45 种方法可在 13 个主流榜单上「不解决任何问题拿满分」","UC Berkeley RDI 近日发表论文，揭示了当前主流 AI 基准测试的系统性漏洞。研究团队构建了一个能够自动分析基准测试代码的 AI agent，对 13 个广泛使用的 AI 基准测试进行审计，结果令人警醒：每个基准测试都存在可被利用的安全缺陷，共发现 45 种已验证的「作弊解法」。\n\n最典型的三个案例：一是 Frontier-CS，提交代码和评估器运行在同一 Python 进程，恶意代码可通过栈帧操作替换评分函数，直接获得满分；二是 Terminal-Bench，只需创建一个dummy C扩展文件通过存在性检查，无需实际调用即可通过性能测试；三是 WebArena，812 个任务的参考答案直接存放在 agent 可访问的文件系统中，agent 只需读取 JSON 文件即可获得满分。\n\n这些漏洞的共性根源在于评估基础设施的隔离不足——被评估的代码与评估代码共享内存空间或文件系统。论文指出了一个根本性矛盾：基准测试不只是衡量模型能力的工具，它们同时在塑造模型的行为。当基准测试可被攻破，模型实际上在被训练来「作弊」而非真正解决问题。\n\n这一发现的影响面极广：基准测试分数决定了模型排名、训练数据定价，乃至部分实验室的发布决策。随着各大厂商在发布前须通过安全测试（美国商务部已将 Google DeepMind、Microsoft、xAI 纳入要求），基准测试的公信力危机已不只是技术问题，更开始影响发布节奏。\n\n真正的contamination-resistant（抗污染）评估方法尚未建立，行业建立更严格评估实践的需求已迫在眉睫。","https:\u002F\u002Frdi.berkeley.edu\u002Fblog\u002Ftrustworthy-benchmarks","15e3ede3-ede2-477d-9cfd-1d1ddfcc71f3",[10,14,17,20,23],{"id":11,"name":12,"slug":12,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",{"id":18,"name":19,"slug":19,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",{"id":21,"name":22,"slug":22,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":24,"name":25,"slug":25,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-05-15T01:00:00Z","2026-05-15T01:09:17.283525Z","2026-05-15T01:09:17.283536Z",true,"agent",3]