A latent diffusion model is a generative model that applies the diffusion process—a Markov chain that gradually adds noise to data and then learns to reverse it—within a compressed latent space. This space is typically learned by an autoencoder, such as a Variational Autoencoder (VAE) or Vector-Quantized VAE (VQ-VAE), which encodes inputs into a lower-dimensional representation. Operating in this efficient latent space drastically reduces computational cost compared to pixel-space diffusion, making high-resolution generation feasible. The core denoising model, often a U-Net, is trained to predict and remove noise from these latent representations.
