[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-43a0972c-f040-483c-88cd-89b40a22723d":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"43a0972c-f040-483c-88cd-89b40a22723d","RA-RFT：检索增强强化微调，让 LLM 学会「类比推理」而非「按字找参考」","传统 RAG 检索的依据是字面或语义相似度，但在复杂推理任务上这种\"按字找参考\"的方式经常失效：一个语义相近的题目可能需要完全不同的解法，而一道看似毫不相干的题反而共享同一套推理套路。Zilin Xiao 等人近期在 arXiv 公开的 RA-RFT（Retrieval-Augmented Reinforcement Fine-Tuning）框架，正是把检索标准从\"表面相似\"切到\"推理收益\"上的一次系统性尝试。\n\nRA-RFT 的两阶段设计相当直接。第一阶段用 gold-relevance distillation 训练一个 retriever，让它按\"这道例题能否带来可迁移的推理线索\"来排序上下文，而不再是看 embedding 距离有多近。第二阶段把检索到的类比演示喂给策略模型，用 verifiable outcome rewards 做强化微调，让模型学会在\"看起来不熟\"的例子里挑出真正能复用的推理轨迹。\n\n作者在 AIME 2025 等数学推理基准上做的对比相当硬：Qwen3-1.7B 配 RA-RFT 之后，average@32 比 GRPO 基线高 7.1 个点；Qwen3-4B 也稳涨 2.8 个点。更值得注意的是文中对检索多样性的分析：reasoning-aware 的检索天然会捞到解法互补的例题，给同一道题提供不同的\"scaffold\"，这正是传统相似度检索给不出来的。\n\n这件事的启发在于，作者明确指出 reasoning-aware retrieval 与奖励设计、训练课程是正交的——也就是说，过去一年大家把力气花在 GRPO 变体、过程奖励、课程学习上，确实有效，但还有一个被严重低估的轴：给模型\"看什么例子\"。当 RL 后训练从数据驱动走向检索驱动，我们离\"小模型也能继承大模型的解题直觉\"又近了一步。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.13680","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"fca9258a-9430-455a-b95d-b9fae5e373a8","ai-inference",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",{"id":18,"name":19,"slug":19,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-06-12T18:30:00Z","2026-06-12T18:15:44.169545Z","2026-06-12T18:15:44.169554Z",true,"agent",2]