Google DeepMind Pushes Robots Toward Real-World Autonomy With Embodied Reasoning Upgrade
- The model improves how robots interpret physical environments by strengthening spatial reasoning, multi-view understanding, and task planning.
- It introduces instrument reading and enhances core functions like pointing, counting, and task completion detection, with measured gains over earlier models and Gemini 3.0 Flash.
Google DeepMind has released Gemini Robotics-ER 1.6, an upgrade to its embodied reasoning model for robotics. The system is designed to help robots move beyond simple instruction-following by enabling a deeper understanding of physical environments. It operates as a high-level reasoning layer, capable of coordinating tools like search, vision-language-action systems, and external functions to carry out tasks.
The update expands core capabilities across pointing, counting, and detecting task completion. For pointing, the model can mark multiple objects in a scene, support comparisons, and define relationships such as moving an item from one location to another, while avoiding marking objects that are not present. For counting, it uses these spatial markers to determine quantities by identifying and tracking items within an image. For detecting task completion, the model determines whether a task is finished or requires another attempt by interpreting visual inputs, including combining information from multiple camera views.
The update also adds instrument reading, allowing robots to interpret gauges, liquid levels, and digital displays. This capability was developed through work with Boston Dynamics, where robots capture images of industrial instruments and extract readings from them.
Source: YouTube / Boston Dynamics
The DeepMind team says the model improves on earlier versions in following safety constraints and shows stronger hazard recognition than Gemini 3.0 Flash in both text and visual evaluations. It is now available to developers through the Gemini API and Google AI Studio, with supporting examples for implementation.
🌀 Tom’s Take:
The ability to point, count, and decide if a task is done, especially across messy, real-world inputs like gauges and multiple camera views, is what turns a system from reactive to autonomous.
Source: Google DeepMind