Meta AI’s Locate 3D Locates Objects from Natural Language Prompts

Meta AI’s Locate 3D Locates Objects from Natural Language Prompts
Source: Meta AI / FAIR at Meta
  • Meta AI's new model turns language prompts into 3D object locations.
  • Localizes items using sensor data—no 3D models or labels needed.

Meta AI’s Locate 3D helps computers understand simple instructions like “the dresser in the hallway” and pinpoint where that object is in 3D space. Instead of relying on detailed 3D models or human markup, it directly interprets what a camera and depth sensor observe, making it suitable for use in robotics and AR devices.

At the heart of Locate 3D is a system called 3D-JEPA, which learns to understand 3D scenes by filling in missing parts. It starts with visual data pulled from 2D models like CLIP and DINO and then learns how different parts of a space relate to each other. Once trained, the model connects this scene understanding with language to highlight the right object; no pre-made 3D models or human labels are needed.

According to the paper, Locate 3D was tested on mobile robots and benchmark datasets and consistently outperformed other systems. In real-world trials, it successfully located and guided the retrieval of objects across unfamiliar indoor environments.


🌀 Tom's Take:

Semantic spatial understanding is a key ingredient for next-level mixed reality experiences and a necessary skill for robots. Identifying parts of our world enables computers to not only find things in the world but, most importantly, take action.


Disclosure: Tom Emrich has previously worked with or holds interests in companies mentioned. His commentary is based solely on public information and reflects his personal views.

Source: Meta AI