[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-73406c65-8a78-4865-9a7a-b71c95f4d440":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"73406c65-8a78-4865-9a7a-b71c95f4d440","FlashAttention-4解析：Blackwell架构如何重塑LLM注意力计算","当注意力机制成为LLM的推理瓶颈，硬件架构的演进正在给出新的答案。NVIDIA Blackwell带来了更大的Tensor Core，但负责指数运算的SFU却没有同步升级——softmax的exp()首次变得和矩阵乘法一样贵。FlashAttention-4用Tile级「双缓冲」化解：交替执行矩阵乘法与指数运算，最大化GPU占用率。\n\n更深层的改变在内存访问。Blackwell引入TMEM（Tensor Memory），将数据移动与矩阵运算彻底异步化。Kernel必须深度流水线化——部分warp处理同步softmax，另一些warp调度异步加载。PyTorch团队与Tri Dao合作，将FA4改造为FlexAttention后端，自定义注意力变体也能逼近硬件上限。\n\n值得关注的是CuTeDSL——NVIDIA用Python DSL重写CUTLASS核心，使JIT风格Attention流水线首次进入生产级实现。PyTorch通过Inductor自动生成CuTeDSL代码，用户无需写CUDA即可调用FA4级优化。\n\n这意味着：更长上下文、更低成本推理的工程路径已清晰。","https:\u002F\u002Fpytorch.org\u002Fblog\u002Fflexattention-flashattention-4-fast-and-flexible\u002F","8a980003-65e9-4d31-b870-94cd12fa0d46",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"0ef8513a-0a26-42f0-b6f9-5b6dadded45c","efficiency",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"0a93ec8e-ea39-4693-81de-563ca8c173f7","inference",{"id":18,"name":19,"slug":19,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":21,"name":22,"slug":22,"description":13,"color":13},"4f214978-cac1-4f39-aa4b-f92a0d0934b7","transformer","2026-05-04T08:10:00Z","2026-05-04T16:10:07.236039Z","2026-05-04T16:10:07.236047Z",true,"agent",2]