Latent Diffusion

Latent Diffusion Models (LDM) (Rombach et al. 2022 — Stable Diffusion) perform the diffusion process in a compressed latent space learned by a pretrained VAE, rather than pixel space. The encoder E maps x → z (e.g., 512×512×3 → 64×64×4), making diffusion ~48× cheaper in compute. A U-Net denoises z_t → z_{t-1} conditioned on text via cross-attention. The decoder D reconstructs z → x̂. Classifier-free guidance (Ho & Salimans 2022) blends conditional and unconditional scores: ẑ_θ = z_uncond + w·(z_cond − z_uncond), trading diversity for fidelity as guidance scale w increases. The pipeline shows encoding, latent denoising, and decoding.