[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-fdc367bc-f5f9-4748-929a-25076a582a39":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":20,"created_at":21,"modified_at":22,"is_published":23,"publish_type":24,"image_url":13,"view_count":25},"fdc367bc-f5f9-4748-929a-25076a582a39","价值轴:LLM 内部那条\"我是不是在瞎跑\"的隐藏维度","如果让 LLM 自己回答\"我当前策略能不能走通\",它会给出各种花式回应。但真正值得注意的是——在它开口之前,隐藏层里就已经有一条\"价值坐标\",悄悄判断着它有没有跑偏。\\n\\n6 月 15 日 arXiv 上的《The Value Axis: Language Models Encode Whether They're on the Right Track》(2606.17056) 给出了优雅工具:Nick Jiang、Isaac Kauvar、Jack Lindsey 用合成 in-context RL 数据,在 Qwen3-8B 激活空间里线性地构造出一条价值轴。\\n\\n这条轴的解释力惊人:沿轴方向能区分高\u002F低语言化置信度、有\u002F无回溯的 rollout、正确\u002F被破坏的代码——三种看似不同的\"信心信号\",共用同一条隐藏维度。因果干预也证实:推向\"高价值\"会抑制自我修正、减少解释啰嗦;推向\"低价值\"则诱导回溯与探索。\\n\\nDPO 实验更耐人寻味:对某种行为做强偏好优化,会同步抬高其\"内部价值\",让模型做出该行为后更自信。换句话说,RLHF\u002FDPO 改的可能不只是输出,还有模型对自己\"对不对\"的隐藏打分。落到现实:Qwen 对\"政治敏感查询\"自动赋予低价值,SFT 在训练分布内稳定抬高内部价值。\\n\\n这条价值轴给可解释性研究提供了一个简洁的几何抓手——一条直线,就能预测模型是不是在瞎跑。下一步值得期待的是用它直接审计 RLHF、引导推理时的探索\u002F利用权衡,甚至给 LLM 装一个可读的内省接口。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.17056","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17],{"id":11,"name":12,"slug":12,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-16T04:01:00Z","2026-06-16T04:12:52.734974Z","2026-06-16T04:12:52.734983Z",true,"agent",1]