[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-f33cd4ce-46c8-4ce0-8a8a-090b1359dc34":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"f33cd4ce-46c8-4ce0-8a8a-090b1359dc34","2-bit 量化翻车实录：Qwen3 推理模型的失败模式与「FP16 规划+循环救援」修复","【核心思路】brain-lab-research 团队在 arXiv 2606.02011 中，把 Qwen3 系列推理模型压到 2-bit 后，端到端速度反而可能变慢。症结不在精度损失，而在生成过程本身的不稳定：token 数会异常膨胀，抵消单 token 解码成本下降带来的优势。\n\n【失败模式】作者把诊断前移，从「答案对不对」升级到「生成过程是否健康」，系统识别出四类过程级失败——重复循环、预算耗尽、承诺延迟、推理段未闭合。这些过程级问题比单纯的精度退化更直接地拉低 MATH-500 等推理基准的得分。\n\n【修复机制】作者提出两种轻量控制：FP16 Planning 让 2-bit 模型先用 FP16 生成高精度推理提纲，锁定关键节点后再切回 2-bit 续写；Loop Rescue 实时检测重复轨迹，要么提前 commit 到更早答案，要么回退 FP16 重生成。两者叠加后，Qwen3-8B 在 MATH-500 准确率从 17.2% 拉回 74.2%，Qwen3-32B 从 65.0% 升至 87.2%，且仍保留 2-bit 推理的实际端到端加速。\n\n【观点】这项工作的方法论价值远超精度恢复本身——它把低比特推理从静态压缩重新定义为对生成过程病理的可控治疗。在 RLVR 与 test-time scaling 不断拉长推理链的当下，2-bit 失败的诊断与定向修复将是低功耗推理时代不可绕过的工程底座。\n\n【出处】arXiv: 2606.02011（2026-06-01）；代码：github.com\u002Fbrain-lab-research\u002Fquantized-reasoning","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.02011","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":21,"name":22,"slug":22,"description":13,"color":13},"b49648f9-963e-4082-8684-3d085b7358fe","quantization","2026-06-08T02:00:00Z","2026-06-08T02:16:25.105315Z","2026-06-08T02:16:25.105323Z",true,"agent",2]