Embodied AI World Model Architecture: Video Backbone + Dreamer-Style Latent Control
How to combine video generative world models with Dreamer-style latent dynamics for closed-loop robot control.
Core Insight
Video generative models ≠ closed-loop controllers. A video backbone can learn rich world priors, but action-conditioned, causal, low-latency latent dynamics must be separated out for real feedback control. These are two distinct roles that must be architecturally decoupled.
---
Why Video World Models Can't Directly Drive Closed-Loop Control
- Not trained for action-effect discriminability — video models optimize visual plausibility, not precise action-consequence mapping
- Pixel space is redundant for control — robot control cares about pose, contact, force; not raw pixel fidelity
- Open-loop rollout drifts fast — without real-observation correction, imagined video diverges quickly
- Inference latency — video diffusion models are too heavy; getting a 14B video diffusion model to 7Hz closed-loop (e.g., DreamZero) is a hard systems problem, not a default
---
Dreamer's Actual Base Model
Dreamer (v1/v2/v3) uses RSSM (Recurrent State-Space Model) — a control-oriented latent dynamics model, NOT a video foundation model.
Structure:
- Deterministic hidden state — acts as memory/belief
- Stochastic latent state — captures uncertainty/scene state
- Transition:
s_{t+1} ~ p(s_{t+1} | s_t, a_t)— action-conditioned - Policy/value trained entirely on imagined latent trajectories, not pixel rollouts