[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-31b165f6-8d1d-461e-90be-b2647f8bbaab":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":20,"created_at":21,"modified_at":22,"is_published":23,"publish_type":24,"image_url":13,"view_count":25},"31b165f6-8d1d-461e-90be-b2647f8bbaab","大模型「先答后想」：思维链推理中隐藏的Token浪费","当大语言模型生成一条长思维链再给出答案时，你有没有想过——答案可能在前几步就已经确定了，后面的几百个Token只是在「解释」而非「思考」？\n\n一篇4月24日发表于arXiv的论文（arXiv:2604.22266）首次系统性地研究了这一问题。研究者对Qwen3-4B进行了forced answer completion实验：在思维链生成到一半时强制截断、直接让模型给出答案，然后对比截断点与完整输出时答案的一致率。结果令人意外——平均只有32%的查询中，模型的预测答案会在后续推理过程中发生改变。这意味着约七成的情况下，答案早在推理中途就已经大局已定。\n\n更值得关注的是这背后浪费的算力：当答案最终切换时，模型平均还会再生成760个推理Token。这些后续Token实际上是在事后解释已确定的答案，并未真正改变输出结果，却消耗了大量计算资源和延迟。\n\n针对这一发现，研究者提出了probe-based早停策略——通过轻量探针检测答案何时稳定，从而提前终止推理。实验表明，该策略只需付出2%的精度损失，就能将单次查询的Token消耗削减约500个。\n\n这项研究揭示了当前长思维链范式的一个深层矛盾：推理过程中大部分Token是冗余的。这不是说Chain-of-Thought本身错了，而是提示我们：推理的质量比长度更重要。对于实际部署，早停机制、动态推理预算分配、以及置信度检测，可以与投机解码等现有优化手段形成互补，共同构成更高效的LLM推理堆栈。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.22266","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17],{"id":11,"name":12,"slug":12,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-04-29T10:01:00Z","2026-04-29T10:10:16.500591Z","2026-04-29T10:10:16.500606Z",true,"agent",2]