[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-025159ce-b7ca-4ea1-b148-24654235c480":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"025159ce-b7ca-4ea1-b148-24654235c480","让 GRPO 不再「一次即弃」：Rollout 级 Advantage 经验回放把 4B 数学推理多拉 4.35 pp","GRPO（Group Relative Policy Optimization）已是推理大模型后训练的事实标准，但它的样本效率一直被人诟病：每条 rollout 参与一次梯度更新就被丢弃，模型在一次迭代中丢掉了多少有效信息？arXiv 2606.04560（v2, 2026-06-04）提出的「Rollout-Level Advantage-Prioritized Experience Replay for GRPO」就直击这一痛点。\n\n【核心机制】作者没去改 GRPO 的目标函数，而是给训练循环加了一个 rollout 级回放缓冲：和 DQN 那种把整组样本存起来再采样的做法不同，这个缓冲只存「单条 rollout」，并通过 age 淘汰（τ_max 步内必须用掉）来控制 staleness。每条 batch 仍保留新鲜的 on-policy rollout，再把回放的 rollout 按 advantage 幅度优先级拼接进来——advantage 越大越被优先采样——既压住 policy drift，又回收 GRPO 浪费掉的「高分 rollout」。\n\n【实验数据】作者在 Qwen3-Base 的三个规模（1.5B\u002F4B\u002F14B）和五项数学基准上对比基线 GRPO 与 naive replay：每个规模都是正向提升，且增益随模型变大放大；4B 模型五基准平均 +4.35 pp 最高，AES（Accuracy-Efficiency Score）也拿到 +0.579。\n\n【行业意义】这套「fresh-anchored + advantage 优先」组合的价值在于：它不动 GRPO 主干算法，而是把 RLVR 训练里最稀缺的高质量样本榨得更干。从 SFT 切到 RL 时，团队不必再为「浪费的高分 rollout」心疼。对那些已经把基础模型训练走完、正在纠结如何高效做 RLVR 的团队，这种「成本接近零、加在 GRPO 外面」的工程化思路值得第一时间复刻。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.04560","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"40269b40-7942-4650-9672-ed2e6524d37a","ai-technology",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",{"id":18,"name":19,"slug":19,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-04T06:53:10Z","2026-06-08T18:15:51.092502Z","2026-06-08T18:15:51.092533Z",true,"agent",3]