From Meshes to Splats: How Spatial AI is Redefining 3D Scene Representation
Spatial intelligence is older than language.
Long before humans drew symbols or shaped tools, we evolved the ability to form internal models of the world — imagining how a rock might slip underfoot, how a branch might bend, how an animal might move if startled. This capacity to simulate and anticipate is the foundation of physical reasoning and, arguably, of intelligence itself.

And for tens of thousands of years, humans have tried to externalize those inner worlds.
Cave drawings, Renaissance paintings, photography, cinematography, CAD software, modern game engines — each step brought us closer to projecting our internal models outward with increasing fidelity. A game designer today can invent a world and share it with billions, complete with physics, lighting, and interactive behavior. These synthetic worlds have even become training grounds for the next generation of intelligent agents, as seen in DeepMind’s SIMA 2 announcement, where robots learn to act and reason inside rich, simulated 3D environments.

Machines are now approaching their own version of this ability.

Modern Spatial-AI systems — and soon full-fledged Large World Models — learn geometry, materials, light, movement. They infer the structure behind images and videos, predict how scenes evolve, and reason about what lies outside the frame. But unlike humans, their spatial understanding is latent. Hidden inside learned weights.

For us to interact with that understanding — to inspect it, navigate it, test it, or simply see what the model believes — we need a representation.
A concrete, manipulable encoding of a 3D world that externalizes an AI’s internal hypotheses.

That representation must feel continuous.
It must carry the richness of real materials.
It must be efficient enough to run on everyday devices.
And it must be general enough to serve as a shared medium between humans and machines.

This is where the long evolution of digital 3D — from meshes to neural fields to splats — becomes central to the future of Spatial AI.
Meshes and Textures — The First Language of Digital Worlds
Before AI started reconstructing worlds from data, humans built them by hand. The backbone of that craft — for films, games, architecture, engineering — was the polygon mesh. Artists sculpted surfaces from triangles and quads, then wrapped them with texture maps that encoded color, fine detail, and sometimes even approximations of lighting. It was a remarkable achievement: a compact, GPU-friendly approximation of the physical world that could be rendered anywhere, from a movie pipeline to a PlayStation.

But meshes weren’t designed to be true representations of reality. They were designed to be workable illusions — efficient abstractions created for human consumption. Their surfaces are discretized, their materials painted, their lighting baked-in. They function beautifully for visual storytelling, where the goal is to be convincing rather than physically complete.

Seen through the eyes of a spatial-AI system, their limitations become obvious.
A mesh has no concept of uncertainty, no smooth gradients of geometry, no natural way to express how a material looks when light grazes it or reflects off it. A texture map is a static picture glued to a surface, blind to context, viewpoint, or time. Even advanced tricks like normal maps and shaders are ultimately patches on top of a fundamentally 2D illusion masquerading as 3D.

And perhaps most importantly:
meshes capture what humans can model — not what the world actually is.
They encode our handcrafted interpretation of surfaces, not the raw, continuous behavior of real materials and light.

Meshes excel at displaying worlds — at painting our imagination onto a screen.
But Spatial AI needs something different: a representation that is learned from data, that reflects the ambiguities of sensors, that respects the continuous nature of the physical world, and that can scale from real environments to imagined ones.

Traditional meshes were the first language of digital 3D.
But they were never meant to be the language of machine understanding.
Loading...
Neural Rendering — Learning How the World Appears
The next leap in 3D representation arrived not from artists or graphics engineers, but from machine learning. Neural radiance fields — NeRFs — changed the mindset entirely. Instead of explicitly modeling surfaces, NeRFs learned how a scene looks from data. You feed the model a handful of photos, and it infers a smooth, continuous function that describes how light should appear from any viewpoint.

It was a profound shift:
geometry, color, shading, and reflectance were no longer handcrafted approximations — they were emergent. Reflections appeared naturally. Subtle translucency and soft shadows emerged from the data. The world felt continuous rather than stitched together from polygons and texture patches.

For the first time, we had a representation that behaved a little like our own perception:
it interpolated, it generalized, it filled in the gaps.

But the breakthrough came with a catch. NeRFs were slow to train, heavy to store, and expensive to render. Their beauty lived inside neural networks that required thousands of evaluations per pixel. They were breathtakingly realistic, yet stubbornly impractical — especially for real-time applications or mobile devices. A representation that was perfect for still images or offline rendering simply couldn’t keep up with interactive spatial-AI systems.

NeRFs raised a tantalizing question:
Could we capture the richness of learned appearance while achieving the speed and efficiency needed for worlds that must be navigated, simulated, or shared?

The search for that balance — neural fidelity with real-time performance — is what ultimately led to Gaussian splatting.
Gaussian Splats — Fidelity Meets Efficiency
Gaussian splatting arrived as the unlikely bridge between the fidelity of neural fields and the practicality of classical graphics. Instead of describing a scene with triangles or burying it inside a neural network, splats represent the world as a cloud of tiny 3D Gaussians — soft, elliptical volumes that each carry position, scale, color, opacity, and even view-dependent shading when needed.

It turns out this simple idea is remarkably powerful.
Gaussians blend seamlessly into one another, so edges stay soft, glossy materials behave naturally, and fine details emerge without the brittle artifacts of polygons. And because each Gaussian is an explicit primitive — not the output of a dense neural function — GPUs can render them extremely quickly.

The result is a representation with near-NeRF richness but real-time speed, even on modest hardware.

What makes splats especially compelling is how accessible they’ve become. We’ve moved from a world where high-quality 3D capture required a structured-light scanner or a photogrammetry rig, to one where a modern smartphone can generate a surprisingly faithful splat reconstruction. The barrier to 3D capture is collapsing: anyone can point a camera, move a little, and produce an immersive, navigable world.

And the technology scales. Entire environments — tens of millions of splats — can now stream and render interactively in a mobile browser. For Spatial AI, this combination of fidelity, continuity, and portability isn’t just convenient; it’s foundational. It provides a representation that is realistic enough for humans, efficient enough for devices, and expressive enough for world models to reveal what they understand.

Gaussian splats showed that a digital world could be learned like a neural field, rendered like a traditional scene, and shared like a lightweight asset.
It’s the first representation that begins to feel universal.
Compression and Streaming — From Raw Splats to SOG, SOGS, and .spz
As Gaussian splats gained traction, one practical issue appeared almost immediately: they were big.
A single splat carries far more information than a point in a point cloud — not just a position and color, but a full 3D covariance (its shape), an orientation, opacity, and sometimes a low-order spherical harmonics vector describing view-dependent appearance. Multiply that by millions of splats and you end up with scenes that are gorgeous… and massive.

The first wave of splat scenes was shared in raw .ply files, simply extending the classic point-cloud format to include these extra attributes. It worked — .ply is simple, portable, and well-understood — but it wasn’t designed for scenes where every point is a tiny ellipsoid with shading coefficients. .ply files ballooned into hundreds of megabytes or even multiple gigabytes. Great for research. Terrible for the web.

From there came the first attempts at purpose-built splat formats:
.splat, .ksplat, .plets, .gsl, and a dozen community experiments. Each tried to make splats more efficient — by quantizing attributes, reorganizing data, or breaking scenes into chunks. They helped, but these formats were stepping stones. None fully solved the dual challenge of compression and streamability needed for real-time Spatial AI applications.

The real jump came with Self-Organizing Gaussians (SOGS).
SOGS reorganized the splat attributes into smooth 2D grids, so nearby Gaussians in space tended to map to neighboring cells. Once that spatial regularity existed, standard image/video codecs could be applied, bringing eye-opening compression ratios — often 20×–40× smaller than raw point-based formats, with surprisingly little degradation.

But 2025 brought an even cleaner solution: Spatially Ordered Gaussians (SOG), the successor and refinement of SOGS.
SOG kept the principle of spatial coherence but implemented it in a way tailored for modern GPU pipelines. Instead of grids, it arranges Gaussians along a Morton curve — a Z-shaped traversal that interleaves the binary bits of x, y, and z. The result is not better compression per se, but dramatically better GPU locality: splats that are close in 3D are close in memory. That means fewer random memory fetches, better caching, and the ability to render tens of millions of splats in real time on mobile and in the browser.

In parallel, Niantic’s .spz format embraced the same philosophy — spatial coherence, aggressive quantization, and a single-stream container optimized for mobile AR workloads. It became the production-ready counterpart of SOG: tiny, fast to load, and engineered specifically for WebXR and phone-based AR capture.

Together, SOGS, SOG, and .spz represent the state of the art in 2025.
They turn splats into portable 3D worlds — small enough to transmit like a JPEG, structured enough to stream, and efficient enough to render anywhere.

This is what finally makes splats not just a beautiful representation for AI training but a deployable medium for Spatial AI, robotics, AR/VR, and world models that need to share what they understand.

PlayCanvas

Church of Saints Peter and Paul

A WebGL application made with PlayCanvas (https://playcanvas.com)

Structure and Physics — When Splats Meet the Real World
Gaussian splats excel at capturing how a scene looks — its colors, textures, soft boundaries, reflections, even the subtle cues that reveal material properties. And that alone already gives Spatial-AI systems more to work with than traditional meshes: appearance is not just decoration. It is information. The sheen of metal, the grain of wood, the softness of fabric — all of these visual signatures correlate with how objects behave.

This is why splats have begun to show surprising promise not just for rendering, but for robot learning and simulation.
Recent “Real-to-Sim” work demonstrated that a high-fidelity splat reconstruction, with appropriate segmentation and alignment, can support soft-body simulation and robot policy evaluation. The remarkable part wasn’t the physics engine — it was how well the robot’s behavior in simulation correlated with real-world outcomes (r > 0.9). Simply put: when the appearance is closer to reality, the agent’s understanding of the scene improves, even before explicit geometry is added.

But splats alone stop short of telling you exactly what the world is.
A splat cloud has no native notion of a surface, a volume, a solid boundary, or a doorway. It encodes the radiance of the world, not its topology.

And agents — virtual or embodied — need topology.
They need to know what can be walked on, grasped, pushed, or avoided.
They need the world to push back.

This is where a layered approach emerges as the most natural solution.

Keep Gaussian splats as the visual layer — the rich, continuous representation that preserves appearance and all the subtle cues that drive perception and material inference. On top of that, add a lightweight structural layer: a collision mesh, voxel field, or geometric proxy that provides just enough solidity for physical interaction.

It doesn’t need to be beautiful.
It just needs to be right enough.

The result is a hybrid world where:
  • splats provide perceptual realism,
  • proxies provide affordances and boundaries,
  • physics engines provide dynamics,
  • and Spatial-AI systems bind everything together into actionable understanding.
In practice, this mirrors how humans see and act.
We perceive the world through appearance, but we reason about it through an internal model of structure — our own mental “collision mesh.” Splats provide the former with unprecedented fidelity; lightweight geometry supplies the latter with minimal cost.

Together, they form worlds that are both true to look at and true to act in — and that combination is precisely what next-generation Spatial-AI needs.

dfattal.github.io

Historic Sites Navigator

Interactive Gaussian Splat based exploration of historic sites

Gaussian in the Wild — Learning Order from Chaos
Once splats are paired with lightweight geometry, you get a world that both looks real and behaves real — a representation rich enough for perception and structured enough for action. But that still leaves a practical question hanging in the air:

Where do these high-quality splat worlds come from?

Traditionally, from careful captures.
Multi-camera rigs, controlled lighting, slow camera sweeps, or specialized scanners. Splats improved the fidelity of the output, but the input still needed to be clean, deliberate, and well-planned — not something you could expect at planetary scale.

Recent work changed that completely.

The family of techniques often referred to as Gaussian in the Wild demonstrated that you can reconstruct detailed 3D Gaussian-splat scenes from the messiest possible data: crowdsourced tourist photos scraped from the internet. Different cameras, different viewpoints, different times of day. People blocking half the frame. Cars, umbrellas, reflections, shadows. Images that any classical reconstruction pipeline would choke on.

And yet, these methods extract coherent, high-fidelity 3D worlds from the chaos.

The breakthrough is twofold. First, robust optimization and appearance modeling allow the system to reconcile wildly inconsistent viewpoints into a stable splat scene. Second, “transient filtering” steps identify and remove anything that doesn’t belong to the permanent structure of the environment — pedestrians, buses, animals, moving shadows, flags, even passing weather artifacts.

What remains is a clean, static, photorealistic reconstruction that often surpasses classical photogrammetry in cluttered environments.

This matters deeply for Spatial AI.

It shows that splat-based representations aren’t just for controlled scans or polished datasets. They can learn from the same ragged, impulsive imagery humans produce when traveling — the collective, uncurated memory of the world. And because splats encode appearance first, not just geometry, the reconstructed scenes retain texture, shading, and material cues that are essential for downstream reasoning.

Gaussian-in-the-Wild techniques turn the visual noise of the internet into structured worlds.

And that brings us closer to the core promise of Spatial AI: systems that can build, refine, and express their understanding of the real world — even when the inputs are incomplete, inconsistent, or messy.
Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting
(NeurIPS 2025 Spotlight)
Spatial-AI Generation — When Models Begin to Reveal Their Inner Worlds
If reconstruction shows how machines can recover the world from data, a new frontier is emerging that shows how they can begin to imagine it. The last two years have brought a wave of models that take sparse input — a few photos, a single image, or even a short text prompt — and expand it into a fully navigable 3D scene, often represented as a Gaussian splat.

These aren’t just compression tricks or geometry estimates.
They are early examples of AI systems revealing the worlds they carry in their heads.

The first clear signs came from models like CAT3D and SEVA.
Google DeepMind’s CAT3D takes a single photograph and expands it into a consistent 3D splat representation, inferring geometry and appearance that were never directly observed. Stability AI’s SEVA works in a similar space: starting from one or a few images, it reconstructs an immersive Gaussian-splat scene, filling in plausible structure and extending the visible world.

CAT4D then pushed the idea further into time. Instead of beginning with a still image, it takes an input video and infers a dynamic 4D environment — something navigable not just in space but through time as well. The result feels much less like a sequence of frames, and more like a small, coherent world rebuilt from a moving camera.

Most recently, Marble from World Labs pulled these ideas into a consumer-friendly tool. From one or more photos — and increasingly even from text prompts — Marble hallucinates a plausible 3D world around the input, generating new geometry, new surfaces, and new viewpoints far beyond the original frame. What begins as a flat picture becomes a place you can explore.

Marble

Marble | World Labs

Create and share 3D worlds with Marble

It’s important to note that these models are not full-blown Large World Models — at least not yet.
They are derived from diffusion video models that clearly encode some 3D structure implicitly, but it remains unclear whether they hold a true, stable internal world model or are simply leveraging vast internet priors to “fake” spatial consistency. The line between genuine 3D understanding and learned visual pattern-completion is still blurry.

Genie 3, also recently released by DeepMind, feels like a step closer to a genuine world model.
It doesn’t just render plausible views — it simulates. Characters can climb stairs, fall off edges, push objects, and interact with their environment in ways that suggest an internal physics engine rather than simple visual interpolation. The system behaves as if it has learned the rules of a world, not just its appearance.

It is still early. These generated worlds are not always consistent or globally accurate. But the friction is evaporating fast. We’re entering an era where any photo — or any idea — can become a navigable environment. Where machines can begin to show us not just what they see, but what they believe the world could look like.
Beyond Fidelity — Representation as Expression
We opened with the idea that humans express their internal models of the world through art, maps, stories, and now fully interactive digital environments. Spatial-AI systems are beginning to form their own internal worlds — geometric, physical, contextual — but those worlds are invisible unless we give them a language.

That language is the representation.

Meshes were our first shared vocabulary.
Neural fields expanded what could be expressed.
Gaussian splats — especially in their modern compressed, streamable forms — are the first representation flexible enough to bridge both sides: faithful for human perception, light enough for real devices, and expressive enough for an AI to reveal what it thinks a world looks like.

What matters isn’t that splats look pretty.
What matters is that they allow an AI to externalize a belief.

Every generated splat scene — from CAT3D expansions to SEVA hallucinations to Marble’s one-photo worlds — is the model answering a simple question:

“Given this sparse input, here is the world I think could exist.”

It is the AI’s internal 3D reasoning made visible, navigable, and debuggable.

For reconstruction models, splats serve as a transparent hypothesis about what was really there.
For generative world models, splats serve as the canvas onto which imagination is projected.
For future LWMs, splats may become the medium through which agents communicate spatial understanding with us — and with each other.

The arc of the story circles back:
Representation is not decoration.
It’s interpretation.
It’s expression.

As sensors, models, and runtime engines converge, we move toward systems that not only reason in 3D, but can reconstruct, invent, and share their internal worlds as fluidly as humans sketch, film, or simulate ours.

If AI is to gain spatial intelligence — and if we’re to understand it — this mutual language of representation is the bridge.
Gaussian splats just happen to be the first one that feels ready for both sides.