A natural question arises: can we train Spatial AI using ordinary videos alone? After all, moving phones capture multiple viewpoints over time, and OpenAI has even described video diffusion models as
"world simulators" systems that learn to predict the evolution of a scene from frame to frame. This is the same line of thinking driving RunwayML’s
recent pivot, where its video generation models are now being fine-tuned for robotics training. The appeal is clear: simulated rollouts are cheaper and faster than collecting endless real-world examples. But here’s the catch: these models are fundamentally guessing what comes next, not knowing the true 3D structure behind the pixels. They churn plausible frames, not grounded geometry.