[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"news-8482714d-e5fa-4a04-a810-199d2582e7b0":3},{"id":4,"title":5,"summary":6,"original_url":7,"source_id":8,"tags":9,"published_at":23,"created_at":24,"modified_at":25,"is_published":26,"publish_type":27,"image_url":13,"view_count":28},"8482714d-e5fa-4a04-a810-199d2582e7b0","VLX-Seek 把「坐标生成」换成「区域引用」：3B VLM 在细粒度感知上硬扛 Gemini 3.1 Pro","Om AI Lab 在 Hugging Face 开源的 VLX-Seek，针对通用 VLM 的「精细定位」短板，把「让 LLM 直接吐 [x1,y1,x2,y2]」的传统做法换成「在候选区域之间做语义检索」的 Region Reference 机制。\n\n核心组件 HFRE（Hybrid Fine-grained Region Encoder）用「语义对齐主编码器 + 高分辨率细节辅编码器」双路结构，让每个 region token 同时承载全局语义和局部细节；配合 Omni Proposal Network 生成候选区域、两阶段训练（区域-语言对齐 + 感知指令微调）以及「目标不存在」拒绝样本。\n\nVLX-Seek-3B 在多项基准上反超同体量或更大的对手：COCO 物体检测 45.3 mAP（Gemini 3.1 Pro 41.4、Qwen2.5-VL-7B 17.7）；OVDEval 开放词表 43.7；RefCOCO 平均 88.7（Gemini 3 Pro 84.1、Qwen3-VL-8B 88.2）；PixMo-Count 计数 85.0（Gemini 2.5 Pro 73.8）。\n\n这套范式的真正意义不在跑分，而在于把 region 升级为「视觉-语言实体」——检测、引用、计数、区域问答首次在同一 token 框架内统一。更短的 region 索引也意味着更少解码开销，让机器人和边缘设备在受限算力下也能持续跑细粒度感知。代码与权重已开源在 github.com\u002Fom-ai-lab\u002FVLX-Seek。","https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fomlab\u002Fvlx-seek","24d5c6c5-6573-4180-a1fd-f1459842d1af",[10,14,17,20],{"id":11,"name":12,"slug":12,"description":13,"color":13},"5e628969-6d2a-437f-998a-104e4b16cfb1","ai-progress",null,{"id":15,"name":16,"slug":16,"description":13,"color":13},"120fa59a-ff6f-4537-9bf5-f818df636a0e","benchmark",{"id":18,"name":19,"slug":19,"description":13,"color":13},"499f4b56-819d-49a3-9609-33e775143b86","multimodal",{"id":21,"name":22,"slug":22,"description":13,"color":13},"b9bd9039-fcdb-41a8-b85b-fc1587def2b9","open-source","2026-06-28T06:01:00Z","2026-06-28T06:15:54.778819Z","2026-06-28T06:15:54.778827Z",true,"agent",3]