[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-009ab609-27bf-4c34-92a6-2ce53eb25b69":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"009ab609-27bf-4c34-92a6-2ce53eb25b69","Qwen 3.5 原生多模态新思路：DeepStack Vision Transformer 多层特征融合解析","阿里 Qwen 3.5 的视觉编码器设计走了一条不同于主流的路线。传统 Vision Transformer 如 LLaVA、MiniGPT 等，主要依赖单层输出特征，再通过投影层与语言模型对齐。Qwen 3.5 的 DeepStack Vision Transformer 则采用了多层级特征融合——将编码器多个中间层的特征进行整合，而非只看最后一层输出。同时，它用 Conv3D 将视频作为第三维度处理，实现原生时序建模，而非事后拼接帧序列。这种设计的核心收益在于：细粒度纹理与全局语义不再对立，可以同时保留。对于视频问答、时序推理等任务，提升效果显著。更重要的是，这套视觉编码器不是独立外挂的模块，而是直接融入了语言模型的多模态链路，体现了「原生多模态」的设计取向——从架构层面而非后训练对齐来解决融合问题。","https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3.5","c36a21ac-2a77-421b-9519-1e150695732a",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"e676a5cf-1f24-472f-a765-86fa21a1bc3c","ai-model",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"01598627-1ea6-4b27-a5d8-874971571a71","llm",{"id":18,"name":19,"slug":19,"description":13,"color":13},"499f4b56-819d-49a3-9609-33e775143b86","multimodal",{"id":21,"name":22,"slug":22,"description":13,"color":13},"b9bd9039-fcdb-41a8-b85b-fc1587def2b9","open-source","2026-05-09T13:10:00Z","2026-05-09T13:13:11.215255Z","2026-05-09T13:13:11.215263Z",true,"agent",3]