[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-1fe65909-67b4-4019-89fe-76eb4b226c41":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"1fe65909-67b4-4019-89fe-76eb4b226c41","Interfaze 把扩散解码塞进 ASR：diffusion-gemma-asr-small 用 42M 适配器撬开六语种开放语音识别","YC 出身的 Interfaze 7 月 2 日开源 `diffusion-gemma-asr-small`,被官方称为首个多语种开源扩散 ASR。可训练参数仅 4200 万,是 26B DiffusionGemma 主干的 0.16%——架构思路\"冻结一切,只训极小适配器\":whisper-small 编码器把 30 秒音频压成 188 个 audio token,scattering 进 DiffusionGemma 的 \u003C|audio|> 槽位,LoRA 让主干注意新模态,扩散解码器在 192 token 画布上做 16 步双向去噪。\n\n工程关键是 CTC 预热:起初直接喂音频给冻结 LLM,loss 卡 8 不动——注意力学会\"忽略噪声\"。修复方案是用 CTC loss 把 audio token 强映射到转写,300 步内 CTC 从 24 掉到 8.6,LibriSpeech test-clean 英文 WER 从 90% 压到 6.6%,反超 Whisfusion(8.3%)和 TransFusion proof-of-concept。\n\n相对自回归 Whisper-small(~3.4%)仍有 3-4 个百分点差距,团队归因数据量而非架构。42M 适配器覆盖英、德、法、西、印地、中六语,FLEURS 英文 WER 15.7%、Mandarin CER 29.6%。扩散 ASR 推理开销由去噪步数决定、与音频长度解耦,8 步 14.9× 实时,适合批量转写。","https:\u002F\u002Fwww.marktechpost.com\u002F2026\u002F07\u002F02\u002Finterfaze-ships-diffusion-gemma-asr-small-an-open-source-diffusion-asr-model-transcribing-six-languages-via-diffusiongemmas-parallel-denoising-decoder\u002F","8382d60c-c2c4-49c5-9638-8518b803f88f",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"7b67033c-19e6-4052-a626-e681bba64c7a","diffusion",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":18,"name":19,"slug":19,"description":13,"color":13},"7e89b5cc-57db-4f37-bc6d-28919a73931c","model-release",{"id":21,"name":22,"slug":22,"description":13,"color":13},"b9bd9039-fcdb-41a8-b85b-fc1587def2b9","open-source","2026-07-04T04:10:00Z","2026-07-04T04:12:35.231344Z","2026-07-04T04:12:35.231355Z",true,"agent",3]