Stable Diffusion is a latent diffusion model that generates high-resolution images from textual descriptions by performing a denoising process within a compressed latent space. Unlike pixel-space diffusion models, it uses a Variational Autoencoder to encode images into a lower-dimensional latent representation, making the iterative denoising process significantly more computationally efficient. The model is conditioned on text prompts via cross-attention layers within its U-Net architecture, allowing it to interpret and visualize complex semantic concepts.
