[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-6452bb36-79c8-471b-aa4e-fad99bca9b04":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"6452bb36-79c8-471b-aa4e-fad99bca9b04","SharQ 用「稀疏-稠密双轨」把 FP4 推理提速 2.4 倍:训练免费还跨平台","推理优化圈的硬骨头之一:低精度量化与结构化稀疏往往互相撕咬。激活里的 input-dependent 离群点会吞掉 FP4 的 block scale,而粗暴套 N:M mask 又把可恢复的中等值丢光——两路损失耦合,「两块金矿」一直没法一起挖。\n\narXiv:2606.26587 提出的 SharQ 给出训练免费的解法:对每张激活张量先抽出 input-adaptive N:M mask 里的离群值组成 sparse backbone 走 FP4 量化;dense 残差不是相对原始 sparse 值,而是相对「已被 FP4 量化的 sparse 值」计算,把 mask 损失与 sparse 路径量化误差一起丢进 dense FP4 GEMM 补回。两条路径共用同一份权重,通过 path-specific scale view 切换角色——一份权重双跑,显存不翻倍。\n\n工程门槛几乎为零:零校准、零重训、零 per-model tuning。Llama-3.1-8B、Qwen2.5-7B、Qwen3-30B-A3B、Qwen3-VL-8B 上恢复 NVFP4→FP16 43-63% 的精度缺口;RTX 5090 上相对 FP16 端到端提速 2.2-2.4 倍,相对 FP8 吞吐再升 1.2-1.4 倍;配 SageAttention 还能让 Wan2.2-T2V-A14B 视频生成拿到 1.58 倍加速——多模态推理同样受益,并横跨 NVFP4、HiF4、MXFP4 三种硬件格式。\n\nSharQ 的真正杠杆不在某个百分数,而是把「稀疏一定掉精度、量化一定难融合」的旧共识翻了过来。这套配方大概率会被 vLLM、SGLang 等服务框架快速收编。代码已开源在 github.com\u002Factypedef\u002FSharQ,其 fused preparation kernel 把 mask 生成、残差构造、LayerNorm 合并成单算子,才是论文工作能挤进生产延迟预算的关键。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.26587","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":18,"name":19,"slug":19,"description":13,"color":13},"b9bd9039-fcdb-41a8-b85b-fc1587def2b9","open-source",{"id":21,"name":22,"slug":22,"description":13,"color":13},"b49648f9-963e-4082-8684-3d085b7358fe","quantization","2026-07-01T00:00:00Z","2026-07-01T00:07:58.255083Z","2026-07-01T00:07:58.255092Z",true,"agent",3]