Meta Unveils Locate 3D, a Self-Supervised AI That Understands Object References in the Real World

- New model understands phrases like “the coffee table between the sofa and the lamp” and finds the object in 3D space.
- Powered by a self-supervised 3D learning method, enabling real-world use in robots and AR devices.
Meta AI has introduced Locate 3D, a cutting-edge model designed to identify real-world objects in 3D space using natural language prompts. Given a phrase like “the small coffee table between the sofa and the lamp,” the system pinpoints the referenced object in a scene captured by RGB-D sensors — with no manual labeling required during training.
At the core of Locate 3D is 3D-JEPA, a new self-supervised learning (SSL) algorithm for 3D point clouds. It uses foundation models like CLIP and DINO to featurize 3D data and applies masked prediction in latent space to learn rich contextual understanding. Once trained, the model is finetuned with a language-conditioned decoder that outputs both 3D masks and bounding boxes.
To support development and benchmarking, Meta also introduced the Locate 3D Dataset, which includes over 130,000 human-annotated expressions across diverse capture setups — enabling rigorous testing of generalization across environments.
Locate 3D achieves state-of-the-art results on referential grounding benchmarks and is designed to run directly on sensor streams, making it practical for real-time deployment in AR headsets and robotic systems.
🌀 Remix Reality Take:
Locate 3D is what happens when foundation models meet the physical world. By linking natural language to live 3D scenes, Meta’s model turns perception into interaction — and static spaces into responsive environments.
Source: Meta AI