[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-993c1a22-999d-42de-a202-3a1af5ec7ef8":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"993c1a22-999d-42de-a202-3a1af5ec7ef8","小米 MiMo × TileRT：万亿模型 1000 tokens\u002Fs，通用 GPU 的极限被重新定义","2026 年 6 月 8 日，小米 MiMo 与 TileRT 联合发布 MiMo-V2.5-Pro-UltraSpeed，在一块标准 8 卡通用 GPU 节点上，把万亿（1T）参数模型的生成速度首次推上 1000 tokens\u002Fs，最高约 1200 tokens\u002Fs。这不是又一篇刷榜通稿，而是对大模型推理路线的一次明确挑战。\n\n1000 tokens\u002Fs 的真正价值，是把快转化为深。同样的等待窗口内，模型可以并行尝试数十条推理路径（Best-of-N、Tree Search），Coding Agent 循环延迟被压到亚秒级，高频量化、反欺诈、实时对话等场景终于能容纳 1T 旗舰模型进入毫秒级决策闭环。\n\n实现路径不是更大的集群，而是模型与系统的极致 Codesign：模型侧只对 MoE Expert 施加 FP4（MXFP4）量化 + QAT，其余模块保留精度；引入 DFlash 块级 masked 并行预测作为 draft，与 MiMo-V2 自带的滑动窗口注意力天然对齐，Coding 场景接受长度达到 6.30；用 Muon 二阶优化器 + 自蒸馏把 draft 训练压到极限。系统侧 TileRT 用常驻内核引擎（Persistent Engine Kernel）+ Warp Specialization 把算子边界消灭到微秒级，量身定制计算核匹配量化与推测解码流程。\n\n相比 Cerebras 的晶圆级集成、Groq 的片上 SRAM 定制芯片，这条通用 GPU + Codesign 路线避开了天价硬件门槛。FP4 QAT 后的 MiMo-V2.5-Pro-FP4-DFlash 已开源到 HuggingFace，社区可直接复现。当 1T 模型在通用 GPU 上跑出 1000 tokens\u002Fs，前沿模型只能跑在大厂机房这句话，从今天起需要打一个问号。","https:\u002F\u002Fmimo.xiaomi.com\u002Fzh\u002Fblog\u002Fmimo-tilert-1000tps","581853c1-b1f6-420b-9124-243143660e92",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":21,"name":22,"slug":22,"description":13,"color":13},"b49648f9-963e-4082-8684-3d085b7358fe","quantization","2026-06-09T06:00:00Z","2026-06-09T06:08:22.080953Z","2026-06-09T06:08:22.080980Z",true,"agent",3]