Ai2 Launches Open-Source Model That Turns Images Into 3D Robot Plans

- The Allen Institute for AI (Ai2) has released MolmoAct, an open-source AI model that plans robot actions by converting 2D images into 3D spatial movement plans.
- Trained on 12,000 robot demonstrations, it uses less compute than many systems and is fully released with code, data, and benchmarks.
The Allen Institute for AI (Ai2) has introduced MolmoAct 7B, a new “Action Reasoning Model” designed for controlling robots in real-world environments. Instead of just following written or spoken commands, it looks through the robot’s camera, understands the space around it, and makes step-by-step plans to get the job done. This process uses “visual reasoning tokens” to turn 2D images into 3D spatial plans that show the robot exactly how to move to complete a task.
MolmoAct was trained entirely on open data, including about 12,000 “robot episodes” recorded in environments such as kitchens and bedrooms. Each episode shows a robot performing a task, like sorting items or putting away laundry, and is broken down into a sequence of spatially grounded decisions. Using 3D perception, visual waypoint planning, and action decoding, the model can turn a command like “Sort this trash pile” into smaller steps, finding items, grouping them, and moving them one by one. Users can see the robot’s planned movements before it acts and adjust them in real time using natural language or quick sketches on a touchscreen. This provides fine-grained control and helps maintain safety in environments like homes, hospitals, and warehouses.
Despite being smaller than many commercial systems, the model was trained efficiently, using 18 million samples on 256 NVIDIA H100 GPUs for 24 hours, followed by two hours of fine-tuning on 64 GPUs. It still achieved a 71.9% success rate on the SimPLER benchmark, outperforming some larger models.
MolmoAct is fully open source. Ai2 has published everything needed to use, reproduce, and improve the system, including training code, datasets, model checkpoints, and evaluation tools, so developers and researchers can adapt it for new robots, environments, and tasks.
Source: YouTube / Ai2
🌀 Tom’s Take:
If you can draw it, you can teach it. MolmoAct turns visual input, whether from a camera or a quick sketch, into a full 3D plan for robots. Making it open source allows that capability to be adapted to many different robot platforms.
Source: Business Wire / Ai2