Inferensys

Glossary

Recurrent State-Space Model (RSSM)

A Recurrent State-Space Model (RSSM) is a latent dynamics model architecture that combines a deterministic recurrent network with a stochastic latent variable to model temporal dependencies.
Architect reviewing LLM integration architecture on laptop, system diagrams visible, modern technical office setup.
MODEL-BASED REINFORCEMENT LEARNING

What is Recurrent State-Space Model (RSSM)?

A core architecture for learning world models from high-dimensional observations, enabling agents to plan via imagination.

A Recurrent State-Space Model (RSSM) is a latent dynamics model architecture that combines a deterministic recurrent neural network (RNN) with a stochastic latent variable to model temporal dependencies in sequential data. It forms the predictive core of world models in algorithms like Dreamer, enabling agents to learn a compressed representation of the environment and simulate future trajectories for planning. The model operates in a learned latent space, not raw pixels, for efficiency.

The RSSM's hybrid design separates deterministic memory (the RNN state) from stochastic uncertainty (the latent variable), allowing it to model partially observable environments effectively. This architecture is trained via variational inference to predict future latent states and reconstruct observations. Its primary use is generating imagined rollouts for training policies and value functions entirely within the model's internal simulation, a process known as latent imagination, which is highly sample-efficient.

ARCHITECTURAL BREAKDOWN

Key Components of an RSSM

A Recurrent State-Space Model (RSSM) is a latent dynamics model that combines deterministic recurrence with stochastic latent variables to model temporal dependencies. It is the core world model in algorithms like Dreamer, enabling agents to learn and plan in a compressed, abstract state space.

01

Deterministic Recurrent State

The deterministic recurrent state (often denoted h_t) is a hidden vector updated by a recurrent neural network (RNN), typically a Gated Recurrent Unit (GRU) or LSTM. It functions as a compressed memory of all past observations and actions, providing a stable, predictable pathway for temporal information flow.

  • Purpose: Maintains a deterministic, evolving context for the stochastic latent variables.
  • Mechanism: Updated as h_t = f_RNN(h_{t-1}, s_{t-1}, a_{t-1}), where s is the stochastic latent and a is the action.
  • Role in Planning: Provides the consistent "backbone" for multi-step imagination rollouts, allowing gradients to flow effectively through time via backpropagation.
02

Stochastic Latent State

The stochastic latent state (often denoted s_t or z_t) is a random variable, typically modeled as a Gaussian, that represents the perceived, uncertain aspects of the environment at time t. It is sampled from a distribution conditioned on the current observation and the deterministic state.

  • Purpose: Captures the partial observability and inherent randomness of the environment that cannot be perfectly memorized.
  • Mechanism: s_t ~ p(s_t | h_t, o_t), where o_t is the observation. It is later used to predict the next observation and reward.
  • Role in Imagination: During planning ("dreaming"), future s_t values are sampled from a prior p(s_t | h_t) without new observations, enabling the agent to simulate possible futures.
03

Observation Encoder & Decoder

These components bridge the high-dimensional observation space (e.g., images) and the low-dimensional latent space.

  • Encoder (Recognition Model): A neural network (e.g., CNN) that compresses a raw observation o_t into parameters (mean and variance) for the posterior distribution of the stochastic latent: q(s_t | h_t, o_t).
  • Decoder (Observation Model): A neural network that reconstructs or predicts the observation from the latent state: p(o_t | h_t, s_t). Its reconstruction loss is a key part of the RSSM's training objective, forcing the latent space to retain all necessary information about the observation.
  • Function: Enables the model to work with pixels directly, learning a semantically meaningful latent state without hand-engineered features.
04

Reward & Terminal Value Predictors

These are simple MLP heads attached to the latent state that allow the model to simulate the outcomes of imagined actions.

  • Reward Model: Predicts the expected immediate reward for a given latent state: r_t ~ p(r_t | h_t, s_t). This turns the RSSM into a reward-predictive world model.
  • Terminal (Continue) Model: Predicts the probability of an episode terminating (or continuing) in a given latent state: c_t ~ p(c_t | h_t, s_t). This is crucial for accurate long-horizon value estimation in episodic environments.
  • Critical for Planning: During imagination, these predictors allow the agent to evaluate the desirability of simulated trajectories without interacting with the real environment.
05

Training via ELBO & KL Balancing

The RSSM is trained end-to-end to maximize a variational lower bound (ELBO) on the log-likelihood of observed sequences. The loss function has several key terms:

  • Reconstruction Loss: log p(o_t | h_t, s_t) - Ensures the latent state contains information to decode the observation.
  • KL Divergence Loss: D_KL( q(s_t | h_t, o_t) || p(s_t | h_t) ) - Regularizes the posterior (from encoder) to stay close to the prior (from dynamics).
  • KL Balancing: A practical technique where the KL term is scaled (e.g., 0.1 to 0.5) or made free for parts of the training to prevent the model from ignoring the latent variables (posterior collapse) or the observations (over-regularization).
06

Latent Imagination for Planning

This is the primary use case of the trained RSSM. Instead of planning in pixel space, the agent's policy and value functions are trained entirely on imagined trajectories within the latent space.

  • Process: Start from a real encoded state (h_t, s_t). For K steps, sample an action from the policy, predict the next deterministic state h_{t+1}, sample a prior latent s_{t+1} ~ p(s_{t+1} | h_{t+1}), and predict reward/continue. This creates a latent trajectory.
  • Efficiency: Backpropagation through time (BPTT) can flow through these compact latent rollouts, making policy optimization via gradient descent highly sample-efficient.
  • Algorithm Example: The Dreamer series of algorithms uses this exact paradigm: learn an RSSM (world model), then train an actor-critic agent using gradients derived from imagined latent rollouts.
TRAINING AND INFERENCE

How RSSM Works: The Training and Inference Process

The Recurrent State-Space Model (RSSM) is trained to compress high-dimensional observations into a latent representation and predict future states, enabling efficient planning through latent imagination.

Training an RSSM involves learning three components from environment interaction data: an encoder that maps raw observations to a stochastic latent state, a deterministic recurrent model (like a GRU) that tracks temporal dependencies, and a decoder that reconstructs observations. The model is trained via variational inference to maximize the evidence lower bound (ELBO), jointly optimizing for accurate reconstruction and predictive consistency of future latent states.

During inference, the trained RSSM serves as a latent dynamics model for planning. Algorithms like Dreamer use it to generate imagined rollouts entirely in latent space. Starting from an encoded state, the RSSM is unrolled by sampling actions from a policy network to produce sequences of future latent states and predicted rewards, which are then used to train the policy and value functions via backpropagation through time (BPTT), all without further real environment interaction.

RECURRENT STATE-SPACE MODEL (RSSM)

Frequently Asked Questions

A Recurrent State-Space Model (RSSM) is a core architecture for learning world models in model-based reinforcement learning. These questions address its mechanics, role, and practical applications.

A Recurrent State-Space Model (RSSM) is a latent dynamics model architecture that combines a deterministic recurrent neural network with a stochastic latent variable to model temporal dependencies in sequential data. It works by processing a sequence of observations (e.g., images from a robot's camera) and actions to infer a compact latent state. This latent state is decomposed into a deterministic component, maintained by a Gated Recurrent Unit (GRU) or LSTM, which tracks information from the entire history, and a stochastic component, which represents the unpredictable aspects of each time step. The model learns to predict the next latent state and the corresponding observation, enabling an agent to "imagine" or plan future trajectories in this learned abstract space rather than the high-dimensional raw observation space.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.