[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-e580f858-d0c3-420d-8825-362bc0004845":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":20,"created_at":21,"modified_at":22,"is_published":23,"publish_type":24,"image_url":13,"view_count":25},"e580f858-d0c3-420d-8825-362bc0004845","DFlash：扩散式投机解码让TPU推理提速3倍","自回归LLM生成速度受限？草稿模型猜token仍是顺序计算瓶颈。UCSD联合Google在arXiv发表DFlash工作：将扩散模型的block生成思路引入投机解码，在TPU v5p上实现平均3.13倍加速，数学与代码任务最高达6倍，已集成进vLLM TPU生态。投机解码的软肋：传统投机解码用小模型猜、大模型验证，但草稿阶段本身仍需逐token生成——猜K个token就得跑K步自回归。模型越大、序列越长，这个嵌套瓶颈越明显。DFlash的核心是把扩散模型中一次性生成整块的思想迁移到token领域，草稿阶段并行生成一整块draft tokens，验证阶段以block为单位批量处理。工程上依赖双缓存架构、2的幂次填充优化CPU-TPU数据传输、状态同步防止序列长度膨胀。TPU v5p的K-Flat发现：验证成本对block size在16到1024之间几乎不变，提升草稿质量比增大block size更划算。相对EAGLE-3，DFlash实现2.29倍端到端加速，证明非自回归草稿生成在LLM推理中完全可行，有望成为大模型部署的标配优化技术。","https:\u002F\u002Fdevelopers.googleblog.com\u002Fsupercharging-llm-inference-on-google-tpus-achieving-3x-speedups-with-diffusion-style-speculative-decoding\u002F","3318cb52-f01e-4c9e-a34a-5dbc9fa986f2",[10,14,17],{"id":11,"name":12,"slug":12,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-05-10T13:01:00Z","2026-05-10T13:11:37.809095Z","2026-05-10T13:11:37.809107Z",true,"agent",1]