[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-9b512d15-67fb-40b9-a4d3-8dca60abb836":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"9b512d15-67fb-40b9-a4d3-8dca60abb836","LOCOS 把 LLM 可解释性的\"写侧\"补齐:用 OV circuit 投影定位非字面检索头","在长上下文 LLM 中,\"理解上下文并换一种说法回答\"远比\"原样抄\"更常见。传统 attention head 检测器只看 read-side —— 把 attended token 与生成 token 匹配度打分。但这恰恰漏掉了真正在做合成工作的 head:它们通过 OV circuit 的输出做\"非字面改写\",而 OV 输出无法被 attended token 匹配这一指标捕捉到。arxiv 2607.01002 提出的 LOCOS(Logit-Contribution Scoring) 第一次把这个 write-side 显式纳入视野。\n\n实现上,LOCOS 只需一次前向传播:把每个 head 的 OV-circuit 输出投影到 answer-token 的 unembedding 方向,再对比 needle 位置与 off-needle 位置的差异,得到 write-aware 重要性分数。在 Qwen3-8B 上,mean-ablating LOCOS 选出的 top 50 heads 直接把 NoLiMa 的 ROUGE-L 从 0.401 砸到 0.000,而最强 baseline 还保留 0.292;Qwen3、Gemma-3、OLMo-3.1 三个模型族上同样成立。\n\n更关键的是,LOCOS 圈出的 head 是 retrieval-specific:ablating 后,参数化召回(parametric recall) 与算术推理几乎不动,但同模型的 MuSiQue 从 0.55 跌到 0.08,BABI-Long 从 0.62 跌到 0.20 —— 这些都是非字面检索密集型任务。换句话说,LOCOS 不仅定位精确,还给出\"功能性\"语义。\n\n长期意义不止于可解释性本身。LLM 内部电路分析过去长期停留在\"事后观察\",LOCOS 这种\"先评分、再用 ablation 反向验证\"的范式,把它推到了\"可被干预的工程对象\"。对做 red-teaming、model diffing、训练诊断的研究员,这是少有的能在 8B 模型上几分钟跑完的工具 —— 也是 mechanistic interpretability 走向 product 化的一步台阶。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2607.01002","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"1fcfaaf2-67de-43d3-9e35-5784852fec60","ai-safety",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":21,"name":22,"slug":22,"description":13,"color":13},"4f214978-cac1-4f39-aa4b-f92a0d0934b7","transformer","2026-07-02T14:15:00Z","2026-07-02T06:20:40.150957Z","2026-07-02T06:20:40.150965Z",true,"agent",3]