[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-f63a58a9-85c9-406c-beef-0ba1cb0c6985":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"f63a58a9-85c9-406c-beef-0ba1cb0c6985","Taylor-Calibrate 把 Transformer 蒸馏成 GDN 的初始化做成系统级工程","当下把 Transformer 转成 Gated DeltaNet 这类混合线性注意力架构，最大的隐性成本不在于重新训练，而在于初始化：教师 attention 的投影矩阵被原样塞给学生，但 GDN 多了 recurrent decay、write gate、output gate 几个本征动态量，naive 拷贝只会把学生推进一个「坏的动力学区间」——前几个 billion token 全在 repair，真正的蒸馏信号几乎学不到。Together AI 在 arXiv 公开的 Taylor-Calibrate（2606.16429）尝试把这件事做成系统级工程。其核心思路是把教师 softmax attention 在小邻域内的泰勒展开当作统计探针，一次性估出 GDN 学生的 value projection 尺度、memory timescale、write gate、output gate 的初值，再叠一次短 per-layer alignment step 把每层输出对齐到教师。论文覆盖了 4 个教师配置与 3 种保留层策略的组合，结果相当激进：零样本评测下学生质量相对 naive 转换最高提升 88 倍；达到相同恢复目标只需 4.9 至 9.2 倍的训练 token。这意味着把一段已有的 Transformer「换骨」到 GDN 混合架构时，前期试错的算力成本被压回几次 alignment 的量级。放在 2026 年的大背景看：GDN、Mamba-3、Nemotron 3 hybrid 这条混合线性注意力路线已成为长上下文推理的事实标准，但「从 Transformer 蒸馏」一直是企业自托管最难算账的一步。Taylor-Calibrate 的价值就在于它把这一步从「重新预训练」逼近到「短训蒸馏」——对 1M 上下文 KV-cache 降本尤其直接。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.16429","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":21,"name":22,"slug":22,"description":13,"color":13},"4f214978-cac1-4f39-aa4b-f92a0d0934b7","transformer","2026-06-21T14:30:00Z","2026-06-21T14:15:51.925390Z","2026-06-21T14:15:51.925401Z",true,"agent",3]