Latent Variable Models - Variational Autoencoders (VAEs)

Published on Tuesday, 02-09-2025

#Tutorials

(Adopted from MIT 6.S191)

image info

Tutorial: Understanding Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) are one of the most important foundations of modern Generative AI. They combine the strengths of classical autoencoders with probabilistic modeling to create smooth, meaningful latent spaces from which new data samples can be generated.

This tutorial builds upon your knowledge of autoencoders and introduces why they fail as generative models, how VAEs solve the problem, and what key ideas like KL divergence and reparameterization mean in practice.


1. Recap: Autoencoders

An autoencoder (AE) is a neural network that learns to:

  • Encode data xx into a low-dimensional latent representation zz.
  • Decode zz back into a reconstruction x^\hat{x}.

The training objective is to minimize reconstruction error:

LAE=xx^2\mathcal{L}_{AE} = \| x - \hat{x} \|^2

This makes autoencoders excellent at compression and denoising, but not necessarily at generation.


2. Why Autoencoders Fail at Sampling

If you sample a random latent vector zN(0,I)z \sim \mathcal{N}(0, I) and pass it through the decoder:

  • The output looks like garbage.
  • Reason: the AE only learns to map inputs to some tangled latent blob without ensuring that it aligns with any known probability distribution.

Visual intuition:

  • The data manifold is like a narrow road in latent space.
  • Autoencoder latents cluster somewhere in space, but not in a structured way.
  • Random zz lands off-road → decoder has never seen such zz → nonsense.

image info


3. Variational Autoencoder (VAE)

VAEs fix this by adding probabilistic structure to the latent space. The idea is:

  1. Assume a prior distribution on latents (usually standard Gaussian p(z)=N(0,I)p(z) = \mathcal{N}(0, I)).
  2. Force the encoder’s posterior distribution qϕ(zx)q_\phi(z|x) to be close to this prior.

Objective function:

LVAE=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}_{VAE} = \mathbb{E}_{q_\phi(z|x)} \big[ \log p_\theta(x|z) \big] - D_{KL}(q_\phi(z|x) \| p(z))
  • First term: reconstruction (like AE).
  • Second term: regularization, keeps latent space well-behaved.

image info


4. KL Divergence

The Kullback–Leibler (KL) divergence) measures how one distribution differs from another:

DKL(q(z)p(z))=zq(z)logq(z)p(z)D_{KL}(q(z) \| p(z)) = \sum_z q(z) \log \frac{q(z)}{p(z)}

In VAEs:

  • q(zx)q(z|x): encoder’s output distribution.
  • p(z)p(z): prior (e.g., Gaussian).

The KL penalty ensures that latent codes are not arbitrary blobs, but instead form a smooth, continuous space aligned with the Gaussian prior.

image info

5. Reparameterization Trick

Problem: Sampling zqϕ(zx)z \sim q_\phi(z|x) is non-differentiable, blocking backpropagation. Solution: Reparameterization trick:

Instead of sampling directly:

zN(μ,σ2I)z \sim \mathcal{N}(\mu, \sigma^2 I)

We sample:

z=μ+σϵ,ϵN(0,I)z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

This makes zz a differentiable function of μ,σ\mu, \sigma, enabling gradient descent.

image info


6. VAE Computation Graph

  1. Encoder: maps x(μ,σ)x \to (\mu, \sigma).
  2. Reparameterization: sample z=μ+σϵz = \mu + \sigma \epsilon.
  3. Decoder: maps zx^z \to \hat{x}.
  4. Loss: reconstruction + KL regularization.

7. Sampling and Latent Perturbations

  • Once trained, you can sample zN(0,I)z \sim \mathcal{N}(0,I) and decode to get new, realistic samples.
  • You can also perturb zz slightly to explore variations of data.
  • Smooth latent space = meaningful interpolations (e.g., morphing between two images).

image info


8. AE vs VAE Latent Spaces

📊 Comparison: Autoencoder vs VAE Latent Spaces

image info

9. Disentanglement with β-VAE

  • β-VAE introduces a scaling factor β>1\beta > 1 on the KL term:
LβVAE=Eqϕ(zx)[logpθ(xz)]βDKL(qϕ(zx)p(z))\mathcal{L}_{\beta-VAE} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \beta D_{KL}(q_\phi(z|x) \| p(z))
  • Effect: stronger regularization → encourages disentangled features (separate factors of variation).

image info


10. Python Example: Simple VAE in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
        super().__init__()
        # Encoder
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
        # Decoder
        self.fc2 = nn.Linear(latent_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, input_dim)

    def encode(self, x):
        h = F.relu(self.fc1(x))
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        h = F.relu(self.fc2(z))
        return torch.sigmoid(self.fc3(h))

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

# Loss function
def vae_loss(recon_x, x, mu, logvar):
    recon_loss = F.binary_cross_entropy(recon_x, x, reduction='sum')
    kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon_loss + kl

11. Summary

  • Plain autoencoders fail at sampling because latent space is unstructured.
  • VAEs introduce probabilistic latent variables with a Gaussian prior.
  • The KL divergence aligns the latent posterior with the prior.
  • The reparameterization trick enables differentiable sampling.
  • Extensions like β-VAE improve disentanglement of latent features.