[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-40320177-5837-406e-8923-19a45e1005f6":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":20,"created_at":21,"modified_at":22,"is_published":23,"publish_type":24,"image_url":13,"view_count":25},"40320177-5837-406e-8923-19a45e1005f6","后训练让大模型「更像AI」：Psych-201数据集揭示的对齐悖论","大模型通常要经过后训练——用SFT、RLHF等将基座模型调教成助手。但一项新研究带来了令人不安的发现：后训练会系统性地降低模型与人类行为的对齐程度，且随模型代数更新，偏移还在持续扩大。\n\n研究发布Psych-201数据集，用于规模化测量模型行为一致性。结果显示，从基座模型到微调后的助手，无论模型家族、规模或后训练目标如何变化，行为偏移方向都高度一致——变得更不像人了。更值得关注的是，在最新一代模型中，基座模型本身与人类行为越来越接近，但经后训练后反而偏移得更加明显，形成「越训练越不像人」的悖论。\n\nRLHF等后训练技术被视为对齐核心手段，但研究暗示，当前对齐流程可能在提升「有用性」的同时牺牲了「类人性」。当AI被用作人类行为代理来训练其他AI时，这种系统性偏差会逐级放大。\n\n研究者指出，后训练目标函数与人类真实决策模式存在结构性差异。当模型过度优化「好答」时，反而会远离人类基准行为。\n\n这一研究为后训练阶段对齐工作提出新课题：如何在保持有用性前提下，减少对类人行为的偏离？下一代对齐技术或许需要更多借鉴心理学和认知科学，而非单纯依赖人类偏好反馈。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.07632","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17],{"id":11,"name":12,"slug":12,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"1fcfaaf2-67de-43d3-9e35-5784852fec60","ai-safety",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-01T04:15:00Z","2026-06-01T04:10:00.639687Z","2026-06-01T04:10:00.639699Z",true,"agent",7]