Fresh off releasing the latest version of its Olmo foundation model, the Allen Institute for AI (Ai2) launched its open-source video model, Molmo 2, on Tuesday, aiming to show that smaller, open models can be viable options for enterprises focused on video understanding and analysis.
In a press release, the company said Molmo 2 “takes Molmo’s strengths in grounded vision and expands them to video and multi-image understanding,” a capability that has largely been dominated by larger proprietary models.
Ai2 released three variants of Molmo 2:
-
Molmo 2 8B, a Qwen-3–based model that Ai2 describes as its “best overall model for video grounding and QA”
-
Molmo 2 4B, designed for more efficient deployments
-
Molmo 2-O 7B, built on the Olmo model
Molmo 2 supports single-image and multi-image inputs, as well as video clips of different lengths, enabling tasks such as video grounding, tracking, and question answering.
“One of our core design goals was to close a major gap in open models: grounding,” Ai2 said in its press release.
The company first introduced the Molmo family of open multimodal models last year, beginning with images. Ai2 said Molmo 2 surpasses previous versions in accuracy, temporal understanding, and pixel-level grounding, and in some cases performs competitively with larger models such as Google’s Gemini 3.
How Molmo 2 compares
Despite their smaller size, the Molmo 2 models outperformed Gemini 3 Pro and other open-weight competitors on video tracking benchmarks.
For image and multi-image reasoning, Ai2 said Molmo 2 8B “leads all open-weight models, with the 4B variant close behind.” The 8B and 4B models also showed strong performance in the open-weight Elo human preference evaluation, though Ai2 noted that larger proprietary models continue to lead that benchmark overall.
But Molmo 2’s biggest gains are in video grounding and video counting, where it outscores similar open-weight models.
“These results highlight both progress and remaining headroom — video grounding is still hard, and no model yet reaches 40% accuracy,” Ai2 said, referring to current benchmarks.
Many video models, such as Google’s Veo 3.1 and OpenAI’s Sora, are typically very large. Molmo 2 targets a different tradeoff: smaller, open models optimized for grounding and analysis rather than video generation.
