Meet SpatialLM: The AI That Understands Space Like Never Before, The 3D LLM Many are Waiting for!
Table of Content
Say hello to SpatialLM , the game-changing 3D large language model that’s now live on Hugging Face—open-source and ready to rock your world. This isn’t just another AI; it’s a powerhouse for spatial reasoning, turning messy 3D data into crystal-clear, structured insights.
Think of it as giving machines the ability to “see” and “understand” the physical world in ways we’ve only dreamed of.
What Can SpatialLM Do?

SpatialLM eats unstructured 3D point clouds for breakfast and spits out detailed scene understanding—things like walls, doors, windows, and even objects with precise dimensions and labels (yes, it knows the difference between a chair and a table). And here’s the kicker: it doesn’t need fancy equipment like LiDAR or depth sensors. It works with everyday sources like phone videos, RGBD images, or even LiDAR scans.
For example, using MASt3R-SLAM , SpatialLM can reconstruct an entire 3D layout from a simple monocular RGB video—and align it perfectly with ground-truth cameras for jaw-dropping accuracy.
SpatialLM is trained on large-scale, photo-realistic dataset. The walls and objects are realistically placed, accurately reflecting real-world scenarios and ensuring physical correctness.
Why Should You Care?
SpatialLM is more than just cool tech—it’s a tool that opens doors to endless possibilities:
- Smarter Robots: Imagine robots that can navigate cluttered rooms or warehouses by understanding every nook and cranny.
- Autonomous Vehicles: Cars that don’t just “see” obstacles but understand their spatial context—like spotting a parked car versus an open garage door.
- Immersive AR/VR: Build hyper-realistic augmented reality experiences or detailed indoor maps with ease.
- Scene Analysis Made Easy: From urban planning to construction, extract actionable insights from complex environments.
Features That Make SpatialLM Shine
Here’s why everyone’s talking about SpatialLM:
- Handles Any Input: Works with monocular videos, RGBD images, and LiDAR sensors—no fancy gear required.
- Detailed Outputs: Generates architectural elements (walls, doors, windows) and object bounding boxes with semantic labels.
- Semantic Smarts: Understands relationships between objects and spaces, not just raw geometry.
- Reconstructs from Videos: Turns simple phone videos into accurate 3D layouts using MASt3R-SLAM.
- Open Source + Accessible: Grab it on Hugging Face and start experimenting today.
The Future Starts Now
With SpatialLM, we’re stepping into a world where machines don’t just process data—they understand it. Whether you’re building smarter robots, designing immersive AR worlds, or analyzing complex scenes, SpatialLM has your back.
License
SpatialLM-Llama-1B is derived from Llama3.2-1B-Instruct, which is licensed under the Llama3.2 license. SpatialLM-Qwen-0.5B is derived from the Qwen-2.5 series, originally licensed under the Apache 2.0 License.
All models are built upon the SceneScript point cloud encoder, licensed under the CC-BY-NC-4.0 License. TorchSparse, utilized in this project, is licensed under the MIT License.
Citation
@misc{spatiallm,
title = {SpatialLM: Large Language Model for Spatial Understanding},
author = {ManyCore Research Team},
howpublished = {\url{https://github.com/manycore-research/SpatialLM}},
year = {2025}
}
Resources

