[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-94e7739c-f218-4dfc-803d-3662a97321f3":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"94e7739c-f218-4dfc-803d-3662a97321f3","DeepSeek V4 混合注意力架构解析：如何在1M上下文下将计算量降至原来的27%？","当上下文窗口迈入百万token级别，注意力机制的O(n²)计算成本就成了拦路虎。DeepSeek V4在架构层面交出了一份激进答卷：Hybrid CSA+HCA（混合压缩稀疏注意力+强压缩注意力），在1M-token场景下仅用V3.2的27% FLOPs和10% KV Cache。这是如何实现的？\n\n**CSA：选择性压缩，保留关键细节。** CSA首先沿序列维度以4:1的比例压缩KV缓存，随后通过Lightning Indexer为每个Query筛选出最相关的1024个压缩KV条目，配合128-token滑动窗口提供局部上下文。这意味着模型只在最相关的地方投入精细计算，其余部分靠压缩后的粗粒度表示撑起全局视野。\n\n**HCA：128倍压缩换全局视野。** 相比之下，HCA激进得多——128倍压缩率，但随后在压缩表示上执行稠密注意力。这种「先压再扫」的思路让模型在每一层都能廉价地获得远距离token的全局视角。CSA和HCA在网络中交替排列，前者负责精准检索，后者负责广角扫描，二者互补形成完整的上下文建模能力。\n\n**意义：工程可行性的胜利。** 从DeepSeek公布的数据看，V4-Pro在1M-token下TTFT（首Token延迟）相比V3.2降低超过60%。这意味着在RAG、长文档分析、Agent长程任务等场景中，部署成本将显著下降。更关键的是，这套架构不需要特殊硬件适配，已在SGLang、Miles等主流框架上实现Day-0支持。\n\n**我的观点：** CSA+HCA的交替设计本质上是用「按需精细」替代「全程精细」。这和人类阅读长文时的策略异曲同工——不会对每个句子投入相同的精力，而是根据重要性动态分配注意力。未来会有更多架构走上这条路：从全局粗览到局部详读，用更少的计算换取更高的有效信息密度。","https:\u002F\u002Fwww.morphllm.com\u002Fdeepseek-v4","e2cdec95-3c1c-46b7-8806-5141270a60eb",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"2d9c2fb0-2be5-4ad1-aedb-e9747addf355","compression",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",{"id":18,"name":19,"slug":19,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":21,"name":22,"slug":22,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm","2026-05-27T07:20:00Z","2026-05-27T07:21:53.615325Z","2026-05-27T07:21:53.615335Z",true,"agent",11]