Molmo 2 Unlocks Spatial and Temporal Understanding for Video and Images
- Molmo 2 introduces spatial and temporal reasoning across video, image, and multi-image inputs.
- New capabilities include video pointing, object tracking, and frame-level grounding with open access to models and tools.
Ai2, a nonprofit AI research institute focused on open and transparent models, has released Molmo 2, its most advanced multimodal system yet. Designed to process video, image, and multi-image inputs, Molmo 2 adds spatial and temporal awareness, allowing systems to understand not just what is happening, but precisely where and when it occurs.
Source: YouTube / Ai2
Molmo 2 introduces new capabilities including video pointing, multi-frame reasoning, and advanced object tracking. Ai2 reports that it uses fewer parameters than the previous version, yet delivers stronger performance, outperforming last year’s model and even proprietary systems like Gemini 3 on several real-world tasks. According to the institute, Molmo 2 ranks at the top of public benchmarks like MVBench and NextQA, which measure how well models understand short videos and visual tasks. Even the smaller version outperforms much larger open models when tested on image-based reasoning. The model can also generate dense, long-form video captions and flag unusual events in extended footage. For every frame, Molmo 2 can return exact object positions and timestamps, allowing it to track what’s happening, where, and when, even in fast or complex scenes.
Ai2 has designed the models for real-world applications in robotics, scientific research, assistive tech, and automation where video understanding and accuracy matter. It is releasing Molmo 2 in three variants, including one built entirely on its open Olmo architecture. All models, datasets, and evaluation tools are freely available on GitHub, Hugging Face, and the Ai2 Playground, with training code coming soon.
🌀 Tom’s Take:
Spatial and temporal understanding is critical for AI systems that operate in the real world. It enables tracking and reasoning over time, essential for safe, reliable performance in robotics, automation, and scientific work.
Source: Businesswire / Ai2