← Back to Blog
AI Research

How Variational Autoencoders (VAE) Work: ELBO, Latent Space, and the Reparameterization Trick

10 March 202624 min read
How Variational Autoencoders (VAE) Work: ELBO, Latent Space, and the Reparameterization Trick

Variational Autoencoders (VAEs) represent a pivotal advancement in the field of artificial intelligence, specifically within the realm of Generative AI. Often, the term "Generative AI" is closely associated with tools like large language models, leading to a narrow understanding of its vast potential. At its core, Generative AI distinguishes itself from traditional AI by its ability to create novel data from scratch, rather than merely processing or analyzing existing information. VAEs are a foundational technique in this creative process, capable of generating entirely new content such as never before seen images and are integral components in advanced generative architectures like Stable Diffusion.


The Challenge with Standard Autoencoders for Generation

Autoencoder Fundamentals

An autoencoder is a type of neural network designed for unsupervised learning that learns an efficient data encoding by training the network to ignore signal noise. It comprises three main parts:

Encoder: Takes high-dimensional input data (e.g. an image) and compresses it into a low-dimensional latent representation z - a vector of numbers capturing the most salient features of the input.

Latent Space: The conceptual space where latent representations reside, with each point corresponding to a potential data instance.

Decoder: Takes the latent representation and reconstructs the original input, aiming for an output as close as possible to the original.

graph TD A["Input Data (e.g. Image)"] --> B["Encoder"] B --> C["Latent Representation z"] C --> D["Decoder"] D --> E["Reconstructed Data"] style B fill:#f9c,stroke:#c66,stroke-width:2px style D fill:#f9c,stroke:#c66,stroke-width:2px style C fill:#ccf,stroke:#66b,stroke-width:2px

Limitations for Generative Tasks

While standard autoencoders excel at compression, they fall short for generating novel content due to several inherent issues:

  • Disorganized Latent Space: No constraints ensure that nearby latent points decode to semantically similar data. Large regions may produce meaningless outputs.
  • Poor Generalization: Randomly sampling from a trained Auto Encoder's latent space typically yields noisy, distorted, or unrecognizable outputs.
  • Lack of Meaningful Variation: Sampling near a known latent point usually just reconstructs the same image rather than generating meaningfully different yet coherent images.

Conclusion: Standard autoencoders are effective for reconstruction, but their latent spaces are not structured for reliable generation of new and coherent data.


The Solution: Organizing the Latent Space with Variational Autoencoders

The primary goal of Variational Autoencoders (VAEs) is to create a nicely organized and continuous latent space, ensuring that:

  1. Points sampled from it consistently yield coherent and plausible new data instances.
  2. Smooth transitions between latent points produce smooth, semantically meaningful transitions in the generated data.

This regularization transforms an autoencoder from a mere data compressor into a powerful generative model.


The Origins of VAEs: Merging Deep Learning and Bayesian Statistics

VAEs were introduced by Diederik P. Kingma in his seminal 2013 paper "Auto-Encoding Variational Bayes," co-authored with Max Welling. Kingma is also known for co-developing the Adam Optimizer.

VAEs merge two powerful paradigms - Deep Learning and Bayesian Statistics - leveraging neural network representation learning while grounding the generative process in probabilistic principles.


Fundamental Bayesian Statistics for VAEs

Random Variable X: A variable whose value is subject to random variations due to chance.

Probability Density Function p(x): Describes the relative likelihood for X to take on a given value x.

Expectation 𝔼[X]: The average value expected when sampling from a probability distribution.

Joint Probability p(x, z): Probability of two variables X and Z occurring together simultaneously.

Marginal Distribution: Distribution of a single variable obtained by integrating the joint over all values of the other:

p(x) = ∫ p(x, z) dz

Likelihood p(x|z): Probability of observing data x given latent variable z. The decoder learns to model this.

Posterior p(z|x): Probability of latent variable z given observed data x. The encoder approximates this.


The General Principle of VAEs: A Probabilistic Approach

The Generative Goal

Generate new data x that appears to come from the true (unknown) data distribution p(x).

Introducing a Latent Distribution

VAEs hypothesize that observed data x is generated from a simpler latent distribution p(z), typically a standard normal (Gaussian):

p(z) = 𝒩(0, I)

By learning to transform samples from p(z) into realistic x's, we can generate new data.

graph TD A["Sample z from p(z) = Normal(0,I)"] --> B["Decoder (Neural Network)"] B --> C["Generated new data x"] style A fill:#ccf,stroke:#66b,stroke-width:2px style B fill:#f9c,stroke:#c66,stroke-width:2px

The Intractable Posterior Problem

To connect observed data x back to latent space z, we need the posterior p(z|x). However, it is almost always intractable to compute directly for complex, high-dimensional data.

Variational Bayes to the Rescue

Instead of computing exact p(z|x), VAEs approximate it with a learnable distribution qφ(z|x) - typically a Gaussian with learnable mean μ and standard deviation σ:

q_φ(z|x) = 𝒩(μ(x), σ²(x) · I)

The Autoencoder Structure in VAEs

Encoder (Inference Network): Takes input x and outputs the parameters μ and σ that define qφ(z|x). Maps images to their approximate latent distributions.

Decoder (Generative Network): Takes a sample z ~ qφ(z|x) and reconstructs x'. Learns the likelihood pθ(x|z).

graph TD A["Input Data x"] --> B["Encoder (Neural Network)"] B --> C1["Mean mu"] B --> C2["Log-Variance log(sigma squared)"] C1 --> D["Reparameterization Trick"] C2 --> D D --> E["Sampled Latent Vector z"] E --> F["Decoder (Neural Network)"] F --> G["Reconstructed Data x-prime"] style B fill:#f9c,stroke:#c66,stroke-width:2px style F fill:#f9c,stroke:#c66,stroke-width:2px style D fill:#ccf,stroke:#66b,stroke-width:2px style E fill:#ccf,stroke:#66b,stroke-width:2px

How VAEs are Trained

VAE training uses a dual objective captured by the Evidence Lower Bound (ELBO):

  1. Latent Space Regularization: KL divergence forces qφ(z|x) to resemble the prior p(z), producing a continuous, well-structured latent space.
  2. Reconstruction Accuracy: Trains the decoder to accurately reconstruct input x from sampled z (using a loss like Mean Squared Error).

The Evidence Lower Bound (ELBO) - The VAE Training Objective

The ELBO provides a tractable lower bound on the intractable log-likelihood log p(x):

ℒ(x) = 𝔼_{qφ(z|x)}[log pθ(x|z)] − DKL(qφ(z|x) ‖ p(z))

i.e., ℒ(x) = Reconstruction Term − Regularization Term

Where:

  • ℒ(x) – The Evidence Lower Bound (ELBO) for data point x
  • 𝔼q_φ(z|x) – Expectation over the approximate posterior distribution
  • pθ(x|z) – Decoder likelihood: probability of data x given latent z
  • DKL – Kullback-Leibler Divergence measuring distributional distance
  • qφ(z|x) – Approximate posterior distribution (encoder output)
  • p(z) – Prior distribution over latent variables, typically 𝒩(0, I)

1. Data Consistency Term (Reconstruction Loss)

𝔼_{qφ(z|x)}[log pθ(x|z)]
  • Measures how well the decoder can reconstruct x from z ~ qφ(z|x).
  • When pθ(x|z) is Gaussian (continuous data like image pixels), this simplifies to Mean Squared Error (L2 loss).
  • Purpose: Ensures the VAE encodes and decodes information faithfully.

2. Regularization Term (KL Divergence)

DKL(qφ(z|x) ‖ p(z))
  • Measures the statistical distance between the approximate posterior qφ(z|x) and the prior p(z).
  • Purpose: Forces the encoder to produce a latent space that is well-structured and continuous, resembling the prior 𝒩(0, I).
Term Mathematical Expression Function in VAE Impact on Latent Space
Reconstruction Loss 𝔼q_φ(z|x)[log pθ(x|z)] Decoder accurately reconstructs the input data. Drives output similarity to input.
Regularization (KL) DKL(qφ(z|x) ‖ p(z)) Encoded distributions conform to a predefined prior. Promotes continuity and organization, enabling generation.

Overall Interpretation: The ELBO acts as a regularized reconstruction loss - demanding accurate reconstruction while compelling latent representations to adhere to a smooth probabilistic structure.


The Reparameterization Trick: Enabling Differentiability

Directly sampling z ~ qφ(z|x) is a non-differentiable operation - gradients cannot flow back to update encoder parameters μ and σ. The reparameterization trick solves this.

The Problem

Computing gradients through a stochastic sampling step breaks the computational graph because the sampling itself has no well-defined gradient.

The Solution

Re-express z as a deterministic function of μ, σ, and an external noise variable ε:

  1. Sample noise from a fixed standard normal:
    ε ~ 𝒩(0, I)
  2. Compute z deterministically:
    z = μ + σ ⊙ ε
graph TD A["Encoder outputs mu, sigma"] --> D["z = mu + sigma times epsilon"] B["Sample epsilon from Normal(0,I)"] --> D D --> E["Sampled Latent Vector z"] style A fill:#ccf,stroke:#66b,stroke-width:2px style B fill:#cfc,stroke:#6a6,stroke-width:2px style D fill:#f9c,stroke:#c66,stroke-width:2px

The Benefit

This makes z a deterministic function of learnable parameters μ and σ, plus the unlearnable noise ε. Gradients can now flow back from the decoder loss to update the encoder - end-to-end training via backpropagation is restored.


VAE Training and Inference in Practice

Training Process

  1. Encoder Input & Output: Input x is encoded to produce μ (mean) and log(σ²) (log-variance). The standard deviation is recovered as:
    σ = exp(½ · log(σ²))
  2. Latent Space Sampling: Sample z via the reparameterization trick:
    z = μ + σ ⊙ ε,  where ε ~ 𝒩(0, I)
  3. Decoder Reconstruction: Sampled z is passed to the decoder to produce x'.
  4. Loss Calculation:
    • KL Divergence Loss (closed form for Gaussians):
      DKL(𝒩(μ, σ²) ‖ 𝒩(0, 1)) = −½ · Σⱼ(1 + log(σⱼ²) − μⱼ² − σⱼ²)
    • Reconstruction Loss:
      ℒ_recon = ‖x − x'‖²
  5. Optimization: The combined loss is backpropagated to update both encoder and decoder weights via an optimizer such as Adam.

Inference (Generation) Process

  1. Sample from Prior: Draw z ~ 𝒩(0, I).
  2. Decode: Pass z through the trained decoder.
  3. Generate: The decoder outputs a completely new, coherent data instance.

Latent Space Evolution and Generation Capabilities

Latent Space Evolution

  • Initially: Encoder output distributions qφ(z|x) are tightly clustered; the latent space is largely undifferentiated.
  • During Training: The KL divergence term acts as a “spreading force,” pushing representations of different inputs apart and conforming them to 𝒩(0, I).
  • At the End of Training: The latent space becomes compact, continuous, and well-organized. Semantically similar data points map to nearby regions.
graph LR A["Initial: Clustered Latent Points"] --> B["KL Divergence Term"] B --> C["Pushes distributions apart"] C --> D["Conforms to Prior Normal(0,I)"] D --> E["Final: Organized and Continuous"]

VAEs' Generative Capabilities

  • Reconstruction: Accurately encodes inputs to latent representations and decodes them back.
  • Random Sampling (Novel Generation): Sampling random z ~ p(z) and passing through the decoder generates entirely new images resembling the training distribution.
  • Interpolation (Latent Space Arithmetic): The continuity of the latent space allows smooth interpolation between two images:
    1. Encode images x1 and x2 to obtain latent means μ1 and μ2.
    2. Interpolate between μ1 and μ2 to get intermediate latent vectors.
    3. Decode these vectors to obtain a smooth morphing sequence from x1 to x2.

Limitations of Basic Variational Autoencoders

  • Blurry Image Generation:
    • VAEs often produce blurry or smooth images, especially for complex datasets.
    • Attributed to the L2 reconstruction loss, which encourages predicting the average of possible pixel values.
    • Compared to GANs and Diffusion Models, VAE outputs lack sharpness and fine details.
  • Lack of Control (Unconditional Generation):
    • Basic VAEs perform unconditional generation - it is difficult to generate a specific type of image or one with desired attributes without modifications.

Advanced VAE Models to Overcome Limitations

CVAE (Conditional VAE): Introduces a conditioning variable (e.g. a class label) into both encoder and decoder, enabling conditional image generation of specific types of outputs.

β-VAE (Beta-VAE): Introduces a tunable parameter β scaling the KL divergence:

ℒ(x) = 𝔼_{q_φ(z|x)}[log p_θ(x|z)] − β · DKL(q_φ(z|x) ‖ p(z))

Higher β yields a more disentangled latent space where individual dimensions correspond to distinct interpretable features, at some cost to reconstruction fidelity.

VQ-VAE (Vector Quantized VAE): Uses a discrete latent space - the encoder outputs an index into a learned codebook of latent vectors. Combined with an auto-regressive decoder, this produces sharper, higher-quality outputs, addressing the blurriness issue.


Understanding the ELBO Derivation

The goal is to maximize log p(x), the log-probability of the data.

  1. Start with the Log-Likelihood:
    log p(x) = log ∫ p(x, z) dz
  2. Introduce the Approximate Posterior - multiply and divide by qφ(z|x):
    log p(x) = log ∫ q_φ(z|x) · [p(x, z) / q_φ(z|x)] dz
  3. Rewrite as an Expectation:
    log p(x) = log 𝔼_{q_φ(z|x)}[p(x, z) / q_φ(z|x)]
  4. Apply Jensen’s Inequality (log is concave, so log 𝔼[Y] ≥ 𝔼[log Y]):
    log p(x) ≥ 𝔼_{q_φ(z|x)}[log(p(x, z) / q_φ(z|x))] =: ℒ(x)
  5. Define the ELBO: The right-hand side is ℒ(x). Maximizing it maximizes a lower bound on log p(x).
  6. Decompose using log(A/B) = log A − log B:
    ℒ(x) = 𝔼_{q_φ(z|x)}[log p(x, z) − log q_φ(z|x)]
  7. Apply Bayes’ Rule p(x, z) = pθ(x|z) · p(z):
    ℒ(x) = 𝔼_{q_φ(z|x)}[log p_θ(x|z) + log p(z) − log q_φ(z|x)]
  8. Rearrange - noting that 𝔼q_φ[log p(z) − log qφ(z|x)] = −DKL(qφ(z|x) ‖ p(z)):
    ℒ(x) = 𝔼_{q_φ(z|x)}[log p_θ(x|z)] − DKL(q_φ(z|x) ‖ p(z))

This final form clearly shows the two components: the reconstruction term encouraging data fidelity, and the regularization term enforcing a structured latent space.


Conclusion

Variational Autoencoders are a cornerstone of modern Generative AI, offering a probabilistic framework to learn rich, continuous, and semantically meaningful latent representations of data. By addressing the disorganized latent space of standard autoencoders through a blend of deep learning and Bayesian statistics, VAEs enable generation of novel, coherent data instances, smooth interpolation between existing ones, and exploration of a smooth generative manifold. While basic VAEs may struggle with sharp image generation compared to GANs or Diffusion Models, their foundational principles - the ELBO and the reparameterization trick - remain invaluable and have paved the way for more advanced generative architectures.


Further Reading

  • Probabilistic Graphical Models
  • Deep Generative Models
  • Kullback-Leibler Divergence
  • Generative Adversarial Networks (GANs)
  • Diffusion Models
  • Information Theory in Machine Learning