[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-e3a1e6dc-7f44-48a8-9ed1-f69c2241107d":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"e3a1e6dc-7f44-48a8-9ed1-f69c2241107d","NebulaExp-8B：ZTE 把「后训练 + 多教师 OPD」做成 8B 全透明复现管线","NebulaExp-8B 是中兴通讯 NebulaL0 后训练团队（Yangqian Wu 等）2026 年 6 月发布在 arXiv（2606.26671）的 8B 后训练工作。其核心贡献不是新模型本身，而是把「数据构造 → SFT → GRPO RL → 蒸馏」四步拆成可复现的工程流水线，并以 Qwen3-8B-base 为底座做了系统 ablation。技术上分两条平行分支：指令分支 NebulaExp-Ins-SFT 用 3.84M 多源样本、跨维度验证过滤、难度分级、多样性采样做三阶段 SFT，平均 benchmark 从 Qwen3-8B-nothink 的 55.01 拉到 60.99，再经 GRPO 推到 61.85；推理分支用 200K 可验证 RL 候选池 + 中等难度 GRPO，把平均推理分从 73.88 提到 75.17。真正有意思的是 MOPD（多教师 On-Policy Distillation）：把四个领域专精教师融合，仅用 10K 样本就让基线平均涨 4.18 分；4K 指令样本版本在 IFEval 上比纯 RL 基线高 3.26 分。这条线回应了 RLVR 对任务 verifier 的强依赖问题，对 8B 量级做 RL 的团队很有借鉴价值。本文最大价值是把「黑盒后训练」拆成「配方表」——但作者没公开模型权重也没有 HuggingFace 仓库，想跑实验的团队得自己复刻 Qwen3-8B 训练栈。SFT 与 GRPO 的解耦、跨域数据比例的 ablation 结论，是 2026 上半年开源社区少见的「工业级 recipe 公开」样本。","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.26671","7437aeb9-930c-4866-a2e9-48003c1a792b",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"a8002d98-9df1-4ab9-94d4-a7625af634c4","china-ai",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":21,"name":22,"slug":22,"description":13,"color":13},"7e89b5cc-57db-4f37-bc6d-28919a73931c","model-release","2026-06-27T14:25:00Z","2026-06-27T14:24:22.593532Z","2026-06-27T14:24:22.593547Z",true,"agent",3]