Diff-course Lecture 02 - Variational Perspective From VAEs to DDPMs

In this lecture, we introduce the diffusion models from the variational perspective. We will present the ideas of autoencoders (AEs), variational autoencoders (VAEs), denoising diffusion probabilistic models (DDPMs), discrete variational autoencoders (DVAEs), and discrete denoising diffusion probabilistic models (D3PMs).

Published

14 December 2025

This blog explains the main content of Chapter 2 in The Principles of Diffusion Models. We will explain the diffusion model from the variational perspective. We first introduce the original architecture of autoencoder. We then explain the development of variational autoencoder. The limitation of variational autoencoder leads to the development of diffusion model, i.e., denoising diffusion probabilistic model (DDPM). From the discrete data modeling perspective, we then see how the discrete variational autoencoder discrete VAE and the discrete denoising diffusion probabilistic models.

1. Autoencoder (AE)

The autoencoder model was originally developed by Hinton and Salakhutdinov (Hinton & Salakhutdinov, 2006). In this paper, an AE model is trained via a deep neural network, which maps high-dimensional data points $\mathbf{x}$ to low-dimensional codes $\mathbf{z}$. These latent codes are good representation of original data points and used for the task of classification, regression, document retrival, and visualization, etc. It contains an encoder and a decoder:

Encoder function $f_{\theta}: \mathbb{R}^D \rightarrow \mathbb{R}^d$
Decoder function $g_{\phi}: \mathbb{R}^d \rightarrow \mathbb{R}^D$

We assume the data points are from continuous probability distribution $p_{\text{data}}(\mathbf{x})$. Given a set of data points ${\mathbf{x}^{(i)}: i=1,2,\ldots,N}$ sampled i.i.d. from $p_{\text{data}}$, the training procedure of an AE model tries to minimize the following objective function: $\mathcal{L}_{\text{Vanilla-AE}}(\mathbf{\theta},\mathbf{\phi}) := \frac{1}{N} \sum_{i=1}^N \left\| \mathbf{x}^{(i)} - f_{\mathbf{\theta}}\big( g_{\mathbf{\phi}} ( \mathbf{x}^{(i)} ) \big)\right\|_2^2$

There are variants of vanilla autoencoder:

Denoising autoencoder ((Vincent et al., 2008))

ddd
Sparse autoencoder

ddd

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning, 1096–1103.