Google Unveils Vision-to-Action AI Models to Power Next-Gen Robots

- Google introduced Gemini Robotics 1.5 (VLA) and Gemini Robotics-ER 1.5 (VLM) to combine high-level reasoning with vision-guided physical action in real-world robotic tasks.
- The models coordinate planning and action across different robot types, with ER 1.5 now available via API and 1.5 offered to select partners.
Google has introduced two robotics-focused models, Gemini Robotics 1.5 and Gemini Robotics-ER 1.5, designed to bring high-level reasoning and vision-guided physical action into real-world environments. Built on the Gemini architecture, the models form an agentic system that links perception, planning, and action.
Gemini Robotics-ER 1.5, a vision-language model (VLM), handles high-level decision-making such as generating step sequences and retrieving online information. It passes instructions to Gemini Robotics 1.5, a vision-language-action model (VLA), which interprets visual input to perform those steps through motor control. Working together, the models support tasks that require both semantic reasoning and precise physical execution, such as object sorting and tool use.
Both models can operate across different robotic platforms. Gemini Robotics 1.5 can transfer learned behaviors between robots without being specialized for each one, supporting varied embodiments such as the ALOHA 2 robot, Apptronik’s humanoid Apollo, and the bi-arm Franka robot.
Gemini Robotics-ER 1.5 is now accessible to developers through the Gemini API in Google AI Studio. Gemini Robotics 1.5 is currently available to select partners.
Source: YouTube / Google DeepMind
🌀 Tom’s Take:
Google’s latest release focuses on foundational models that aim to let robots of all types act without much human intervention or hard-coding.
Source: Google DeepMind