Spatial intelligence is older than language.
Long before humans drew symbols or shaped tools, we evolved the ability to form internal models of the world — imagining how a rock might slip underfoot, how a branch might bend, how an animal might move if startled. This capacity to simulate and anticipate is the foundation of physical reasoning and, arguably, of intelligence itself.
And for tens of thousands of years, humans have tried to
externalize those inner worlds.
Cave drawings, Renaissance paintings, photography, cinematography, CAD software, modern game engines — each step brought us closer to projecting our internal models outward with increasing fidelity. A game designer today can invent a world and share it with billions, complete with physics, lighting, and interactive behavior. These synthetic worlds have even become training grounds for the next generation of intelligent agents, as seen in DeepMind’s
SIMA 2 announcement, where robots learn to act and reason inside rich, simulated 3D environments.
Machines are now approaching their own version of this ability.
Modern Spatial-AI systems — and soon full-fledged Large World Models — learn geometry, materials, light, movement. They infer the structure behind images and videos, predict how scenes evolve, and reason about what lies outside the frame. But unlike humans, their spatial understanding is
latent. Hidden inside learned weights.
For us to interact with that understanding — to inspect it, navigate it, test it, or simply
see what the model believes — we need a representation.
A concrete, manipulable encoding of a 3D world that externalizes an AI’s internal hypotheses.
That representation must feel continuous.
It must carry the richness of real materials.
It must be efficient enough to run on everyday devices.
And it must be general enough to serve as a shared medium between humans and machines.
This is where the long evolution of digital 3D — from meshes to neural fields to splats — becomes central to the future of Spatial AI.