SpatialLM Turns Raw Point Clouds Into Structured 3D Scene Descriptions

SpatialLM is a new open-source language model that takes raw 3D point clouds as input and outputs structured, machine-readable descriptions of indoor spaces. Feed it a messy 3D scan, and the model identifies exactly where the walls, doors, windows, and furniture are, outputting their precise metric coordinates and bounding boxes.

The system addresses a persistent bottleneck in spatial computing and 3D digitization workflows: extracting semantic meaning from raw capture data.

The Problem With Point Clouds

A point cloud is just a collection of 3D coordinates captured by LiDAR scanners, depth cameras, or photogrammetry software. While they accurately map a physical space, they are entirely unstructured. The data does not inherently know that a vertical cluster of points forms a wall or that a gap between surfaces is a doorway.

Extracting that structured information typically requires manual modeling or specialized reconstruction software that struggles with noisy, incomplete scans. SpatialLM changes the equation by applying language model reasoning directly to the raw spatial data.

How It Works

The model architecture follows the pattern of multimodal LLMs, but adapted for 3D input. A point cloud encoder processes the raw spatial data into tokens. A projection layer maps those tokens into the language model's embedding space. Then a standard LLM (either Llama 3.2 1B or Qwen 2.5 0.5B) generates structured output autoregressively.

The output format is highly legible: Python-like dataclass instantiations, with one element per line. A wall gets defined by two 3D endpoints, height, and thickness. Doors and windows reference their parent wall with position and dimensions. Furniture gets a semantic label, 3D position, rotation angle, and bounding box scale. All coordinates are quantized into 1,280 bins across a 0-to-32-meter range, giving roughly 2.5cm resolution.

Critically, SpatialLM accepts point clouds from diverse sources, including LiDAR scans, RGB-D cameras, and even monocular RGB video reconstructed through SLAM.

Performance and Availability

On the Structured3D dataset, SpatialLM achieves a 94.3% F1 score for layout estimation, outperforming prior methods like RoomFormer (83.4%) and Meta's SceneScript (90.4%). On ScanNet's 3D object detection benchmark, it matches the specialist V-DETR model at 65.6% F1.

The models are small enough (0.5B and 1B parameters) to run on standard consumer hardware. The code, model weights, and training dataset are all open source. The GitHub repository hosts both version 1.0 and 1.1, with the latter using an improved Point Transformer encoder that doubles spatial resolution. Model weights are available on Hugging Face.

SpatialLM was developed by Manycore Research in collaboration with the Hong Kong University of Science and Technology.

SpatialLM Turns Raw Point Clouds Into Structured 3D Scene Descriptions

The Problem With Point Clouds

How It Works

Performance and Availability

Reply

Keep Reading