Apple’s SHARP Turns a Single Image into a 3D Scene in Under a Second

Apple’s SHARP Turns a Single Image into a 3D Scene in Under a Second
Source: Github / Apple
  • SHARP creates a detailed 3D scene from one photo using a fast neural network that runs in real time on standard hardware.
  • The method delivers sharper, faster, and more accurate results than previous systems, with no extra training needed.

Apple has released SHARP, a new neural network that turns a single photo into a detailed 3D scene. It builds a representation made of 3D Gaussian elements that capture the scene’s surfaces, depth, and lighting at real-world scale. The result can be rendered instantly, producing high-resolution, photorealistic views from nearby angles, all in under a second and at over 100 frames per second on standard hardware.

SHARP works by running the input image through a neural network that predicts the full 3D structure in one step. The model outputs everything needed to build and render the scene, with no optimization or fine-tuning required. The output maintains fine details and sharp structures, and because it’s built at real-world scale, users can move a virtual camera through the scene as if navigating a physical space.

In testing, SHARP outperformed other leading methods across multiple datasets without any additional training. It lowered key error metrics—LPIPS and DISTS—by up to 43%, while running about 1,000 times faster than previous approaches, according to the research. Apple has released both the code and pre-trained model weights on GitHub, along with a command-line tool that outputs 3D Gaussian splats as .ply files, compatible with public 3DGS renderers.


🌀 Tom’s Take:

SHARP shows how Apple could move beyond stereo capture toward full 3D scene reconstruction, potentially unlocking spatial photos from any image, not just ones taken with depth-enabled cameras.


Source: arXiv / Github