[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-f8207ff2-88ad-4e11-a87c-8350ffd42c01":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"f8207ff2-88ad-4e11-a87c-8350ffd42c01","推理训练在悄悄「偷走」模型对齐：arXiv 新论文六大维度系统审计","当一个合规的指令模型被改造成推理模型时，我们究竟得到了什么，又失去了什么？arXiv 6 月 9 日的论文《Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models》给出了一个让人不太舒服的答案：这种\"改造\"几乎一定会让对齐全线退化。\n\n来自科罗拉多大学博尔德分校、UCF、马里兰大学和威斯康星麦迪逊分校的研究者，系统比较了 SFT 思维链、RL 后训练（含 GRPO 类变体）、从更强教师蒸馏三条主流通路，并在安全性、毒性、刻板印象、机器伦理、隐私、OOD 鲁棒性六大维度上做了对照审计。受测模型覆盖了 Qwen2.5\u002F3、DeepScaleR、s1\u002Fs1.1、DeepSeek-R1-Distill，以及 OpenAI o1、Claude Opus 4.6、GPT-4\u002FGPT-5 等主流闭源推理模型。\n\n关键发现有三：三条路径都呈现\"能力涨、对齐跌\"模式，但跌法不同——SFT 路径在毒性和伦理判断上掉得最明显，GRPO 类 RL 路径在刻板印象上放大最严重，蒸馏路径则在拒绝校准上偏移最大；KL 散度可作为\"漂移诊断\"，与基线指令模型漂移越大、对齐退化越严重；而现有技术报告和经验论文几乎只测安全性，其余五大维度普遍空缺，使\"对齐全\"成了一种系统性错觉。\n\n最值得反思的是最后一点。过去半年 o1、R1、Qwen3、Claude Opus 4.6、GPT-5 等发布时，厂商几乎只强调\"推理基准涨了几个点\"，对自家模型在隐私泄露、刻板印象或毒化提示下的行为漂移只字未提。本文用受控基线证明：这种\"涨分\"不是免费午餐，每条主流通路都付出了可量化的对齐代价。对开发者而言，最直接的启示是发版 checklist 应把六大对齐指标与能力指标并列，否则今天刷新的 SOTA 推理模型，可能就是明天被攻击面最大的一次升级。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.11046","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"1fcfaaf2-67de-43d3-9e35-5784852fec60","ai-safety",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",{"id":18,"name":19,"slug":19,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-10T12:15:00Z","2026-06-10T12:14:54.756729Z","2026-06-10T12:14:54.756738Z",true,"agent",1]