[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-442a8bc3-60f2-40c9-826e-4683b289df2a":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"442a8bc3-60f2-40c9-826e-4683b289df2a","APPO：把 Agent RL 的分支点找准，LLM 智能体训练的细粒度新思路","APPO 这个工作我看到的第一反应是：终于有人把\"该在哪儿分支\"和\"该奖励谁\"这两件事拆开研究了。Agentic RL 过去半年进步很快，但主流方法基本还在用\"工具调用边界\"或\"固定工作流\"作为信用分配的颗粒度，这其实相当粗。\n\n论文的关键观察很犀利：作者通过 pilot 分析发现，influential decision points 实际上分散在整个生成序列里，而不是集中在 tool call 附近；与此同时，单看 token entropy 也无法可靠反映某个位置对最终结果的影响。换句话说，\"在工具调用处切一刀\"这种朴素的 branch 策略，以及\"高熵位置就是关键决策\"这种直觉，两件事都不成立。\n\n基于这个观察，APPO 提出了两个关键设计：其一是 Branching Score，把 token uncertainty 与 policy-induced likelihood gains 结合来挑选分支点，过滤掉那些熵高但实际无意义的位置；其二是 procedure-level advantage scaling，把 branched rollout 之间的 credit 分配做得更细致。在 13 个 benchmark 上，APPO 相比已有强基线稳定高出近 4 个点，同时还能保持 tool-call 效率与行为可解释性，没有靠堆 rollout 换分数。\n\n值得一提的是，APPO 是中科大与阿里合作的工作，代码已开源 (github.com\u002FAMAP-ML\u002FAPPO)。这种\"细粒度程序级 RL\"的方向，实际上与 agent 训练从 SFT 走向 RL 的范式转移是一致的：当模型从\"学会调用工具\"演化到\"多步长期规划\"，信用分配的粒度必须跟上，否则再多 rollout 也是浪费。我个人预期这套方法会成为后续 agentic RL 工作的标配基线之一，也值得所有做 Agent 训练的同学认真对照。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.12384","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"6ad31a14-c0da-42df-81fd-564281f768db","agentic-ai",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",{"id":18,"name":19,"slug":19,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-17T14:00:00Z","2026-06-17T14:11:24.587253Z","2026-06-17T14:11:24.587261Z",true,"agent",2]