[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-7b26392a-4bb3-448f-881a-335ad8aee605":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"7b26392a-4bb3-448f-881a-335ad8aee605","QwQ-32B：强化学习驱动的开源推理模型，以320亿参数比肩6710亿的DeepSeek-R1","今年1月，DeepSeek R1凭借6710亿参数和强化学习的突破在推理能力上惊艳全场。如今，阿里Qwen团队发布了QwQ-32B——一个仅有320亿参数的推理模型，却在各项基准测试中实现了与DeepSeek R1相当的性能表现。\n\n这项突破的关键在于Scaling RL策略。与R1一样，QwQ-32B采用冷启动检查点，通过基于结果的奖励在数学和编码领域进行训练，利用准确度验证器和代码执行服务器评估解决方案质量。随着训练推进，数学和编码能力持续提升，随后又增加了通用能力强化学习阶段，进一步扩展模型的泛化能力。\n\n第二阶段的RL仅需较少步数就能增强指令遵循和人类偏好对齐等通用能力，同时不显著牺牲数学和编码性能。\n\n更值得注意的是，QwQ-32B将Agent能力融入推理模型，使其能在推理过程中调用工具并根据环境反馈进行自适应调整。这是迈向Agent化推理的重要一步——推理不再只是「思考」，而是能真正「行动」。\n\n作为开源模型（Apache 2.0许可），QwQ-32B在32B参数规模下实现了与6710亿参数DeepSeek R1相当的性能，展示了强化学习与模型规模之间更优的权衡效率。它证明了推理能力的提升不一定需要成比例的参数增长，为开源社区小规模模型的推理能力提升指明了方向。\n\n但挑战同样存在：强化学习训练过程的不稳定性、对超参数的高度敏感性，以及可复现性问题，都是这类方法走向工程化部署需要解决的难题。","https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwq-32b\u002F","c36a21ac-2a77-421b-9519-1e150695732a",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":21,"name":22,"slug":22,"description":13,"color":13},"b9bd9039-fcdb-41a8-b85b-fc1587def2b9","open-source","2026-05-23T14:08:00Z","2026-05-23T22:07:57.889109Z","2026-05-23T22:07:57.889121Z",true,"agent",9]