From Diffusion models to Flow Matching

The pace of new ideas in diffusion models over the past few years has been pretty exciting. Keeping up with fast moving fields like this is challenging both because of the ever growing lexicon, and because of layers of frameworks that build upon each other.

I wrote this to try to trace out the key ideas of the past few years around diffusion models. We’ll start with DDPM, and try to cover SDEs, denoising score matching, and flow matching. It’s probably incorrect to think of these topics as building upon one another - they’re probably better described by a fully connected bidirectional graph, but that isn’t a very helpful mental model.

DDPM, aka the original diffusion model

Calling anything the “original” diffusion model is probably controversial, but certainly the DDPM paper initiated much of the attention in the field since it was published.

The overarching idea of most of these generative models is to learn a model that can convert noise from a distribution that’s easy to sample from (mostly Gaussians) into a sample that belongs to the unknown distribution that your data comes from. You’re given some samples from the distribution to try to learn this model.

DDPM considers a fixed forward process that converts data samples \(x_0\) into noise:

\[q(x_{t+1} | x_t) = N(x_{t+1} ; \sqrt{1 - \beta_t} x_t , \beta_t \cdot I)\]

Here the variance \(\beta_t\) defines the noise schedule. There’s lot’s of choices here, but all essentially start with close to pure signal at the beginning of the schedule (\(\beta_0 \rightarrow 0\)) and close to pure noise at the end (\(\beta_T \rightarrow 1\)).

During training, we can use the forward process to corrupt data \(x_0\) into noise \(x_t\), whose distribution at \(t=T\) is the standard normal distribution (for common choices of the noise schedule). Importantly, this forward process is fixed, and not learned.

During inference, we’d like to reverse this and turn samples \(x_T\) from our noise distribution into ones which resemble our data. For this, we learn the reverse process:

\[p(x_{t-1} | x_t) = N(x_{t-1} ; \mu_\theta(x_{t}, t), \Sigma_\theta(x_t, t))\]

where the mean and covariance matrix are parameterized by neural networks with parameters \(\theta\).

Diffusion models maximize the evidence lower bound (ELBO). In DDPM, the authors showed that if we try to get the learned reverse process to approximate the true posterior:

\[q(x_{t-1} | x_t) = N(x_{t-1} ; \mu(x_t,x_0), \beta_t^\prime \cdot I)\]

then the ELBO bound actually just reduces to minimizing a MSE loss! Here variance is fixed, though the actual coefficient \(\beta_t^\prime\) and mean \(\mu(x_t,x_0)\) that appear in \(q(x_{t-1} \mid x_t)\) aren’t important to remember; you can always work them out using Bayes rule. The loss becomes:

\[L(\theta) = \sum_t L_t\] \[L_t(\theta) \propto E_{x_0 \sim \text{data}} || x_\theta(x_t,t) - x_0 ||^2\]

where \(x_t\) is the noise-corrupted data sample from the forward process.

Further, since the loss is just summed over time, then we can speed up the training by just sampling a random time \(t\) and minimizing \(L_t\) at each step.

Two final points:

A popular reparameterization trick lets us work out \(x_t\) from \(x_0\) in a single step:

\[q(x_t | x_0) = N(x_t ; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) \cdot I)\] \[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon\]

Here we switch the notation around the diffusion schedule:

\[\alpha_t = 1 - \beta_t\] \[\bar{\alpha}_t = \prod_{s=0}^t \alpha_s\]

At the beginning of the schedule, we typically have all signal \(\alpha_0 \rightarrow 1\), and at the end, all noise \(\alpha_T \rightarrow 0\).

In the original DDPM paper, the authors show that it’s convenient to rewrite the loss in terms of the noise, such that the MSE loss actually becomes something like:

\[L_t(\theta) = \lambda_t || \epsilon_\theta(x_t, t) - \epsilon ||^2\]

where \(\epsilon\) is the noise that we added to obtain the corrupted data sample. This matters a lot in practice; in fact, both choices aren’t that great. If you choose to use the noise as the loss, then you’re not learning much toward the end of the diffusion schedule where \(x_t\) is almost pure noise. Similarly, if you choose the signal as a target instead, then early times are trivial. A popular choice to do better here is the velocity target:

\[v_t = \sqrt{\bar{\alpha}_t} \epsilon - \sqrt{1-\bar{\alpha}_t} x_0\]

which essentially predicts the noise \(\epsilon\) at the beginning of the schedule, and the signal \(x_0\) at the end. All of this can be unified by breaking down the \(\lambda_t\) coefficient.

The final recipe then is:

(Setup) Choose the noise schedule.
During training, choose a time \(t\) and corrupt a data sample \(x_0\) with some sampled noise to obtain \(x_t\). Try to get the network to predict the noise that was added to the sample.
At inference time, sample some noise, and use the predictions from the model to construct \(x_{t-1}\) from \(x_t\).

Score-based diffusion models

Let’s take another look at the forward process in DDPM:

\[q(x_{t+1} | x_t) = N(x_{t+1} ; \sqrt{1 - \beta_t} x_t , \beta_t \cdot I)\] \[x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon_t\]

This looks almost like a step from an Euler-Maruyama method for an SDE:

\[x_{t+1} = x_t + f(x_t,t) \Delta t + g(t) \sqrt{\Delta t} \cdot \epsilon_t\]

(here we let \(t \in [0,1]\)).

In fact, if we expanded \(\sqrt{1-\beta_t} = 1 - \beta_t / 2 + O(\beta_t^2)\) for small noise levels:

\[x_{t+1} = x_t - \frac{\beta_t}{2} x_t + \sqrt{\beta_t} \epsilon_t\]

Then we could identify (absorbing \(\Delta t\) here into \(\beta_t\) when moving from continuous to discrete time):

\[f(x_t,t) = - \frac{\beta_t}{2} x_t\] \[g(t) = \sqrt{\beta_t}\]

Neat! We realize we could do all of this in continuous time instead of discrete, and that we can generalize this setup to different choices for the drift and diffusion terms \(f(x_t,t)\) and \(g(t)\) by considering SDEs of the form:

\[dx_t = f(x_t,t) \cdot dt + g(t) \cdot dw_t\]

where \(w_{t+\Delta} - w_t \sim N(0, \Delta \cdot I)\). This SDE takes samples from our data distribution \(t=0\) and moves them to our noise distribution at \(t=1\).

Ultimately, we really want to solve the reverse-time SDE to do sampling:

\[dx_t = (f(x_t,t) - g^2(t) \nabla_x \ln p_t(x_t) ) dt + g(t) \cdot d \bar{w}_t\]

Since we known neither the log density nor it’s gradient, we replace it with a learned model, called the score function:

\[s_\theta(x_t,t) \sim \nabla_x \ln p_t(x_t)\]

We might be tempted to learn this directly using an objective similar to that of the DDPM:

\[L_t(\theta) \sim || s_\theta(x_t,t) - \nabla_x \ln p_t(x_t) ||^2\]

but again, since we don’t know the true density, we wouldn’t know how to train this. Instead, we can use a popular score function identity:

\[\nabla_x \ln p_t(x_t) = E_{x_0 \sim \text{data}} [ \nabla_x \ln p_t (x_t | x_0) ]\]

The proof of this straightforward: move the logarithm outside the gradient, substitute the definition of the marginal distribution and use Bayes’ theorem. I’m using \(p\) here to keep the notation more similar to other literature, but this is really the same as \(q\) in DDPM. The loss becomes:

\[L_t(\theta) \sim E_{x_t \sim p_t (x_t | x_0), x_0 \sim \text{data}} || s_\theta(x_t,t) - \nabla_x \ln p_t(x_t | x_0) ||^2\]

This is called denoising score matching.

This has a simple relationship to the noise in DDPM, where we can work out (by substituting the Gaussian for \(q\)):

\[\nabla_x \ln q(x_t | x_0) = - \frac{1}{\sqrt{1-\bar{\alpha}_t}} \epsilon\]

More commonly, this is written as \(s_\theta(x_t,t) = - \epsilon_\theta(x_t,t) / \sigma_t\) where \(\sigma_t^2\) is just the variance of the forward process.

A couple last points:

Solving the reverse time SDE is slow, just like DDPM. Instead, we can consider the deterministic probability flow ODE (PF-ODE). This shares the same marginals with the SDE, and its deterministic flow pushes mass in the same way:
\[\frac{dx_t}{dt} = f(x_t,t) - \frac{1}{2}g^2(t) \nabla_x \ln p(x_t,t)\]
to which we can apply faster ODE solvers. Heun’s method is a popular choice.
The choice we made above for the drift and diffusion terms is a variance preserving diffusion model. Another popular choice is a variance exploding diffusion model, e.g.
\[f(x_t,t) = 0\] \[g(t) = \sigma^t \; \text{for} \; \sigma > 0\]

Let’s reflect on the whole recipe:

(Setup) Choose the forms for the noise and diffusion terms, and work out the form of the conditional distribution \(p_t(x_t \mid x_0)\) - i.e. how to corrupt the data with noise.
During training, follow the same recipe as DDPM: choose a time \(t\) and corrupt a data sample \(x_0\) with some sampled noise to obtain \(x_t\). Try to get the network to predict the gradient of the log conditional density.
At inference time, sample some noise \(x_1\), and then choose your favorite ODE solver to solve the PF-ODE backwards in time, replacing the score function with the learned model.

Flow matching

Let’s try to get even more general. We’re now solving ODEs backwards in time, which are of the form:

\[\frac{dx_t}{dt} = f(x_t,t) - \frac{1}{2}g^2(t) \nabla_x \ln p(x_t,t)\]

Can we generalize this? Let’s consider a generic ODE:

\[\frac{dx_t}{dt} = v_t(x_t, t)\]

where \(v\) is the vector field, and we’re still going to consider the same range of time \(t \in [0,1]\). Unfortunately, most literature introduces a switch in notation here, and we’re going to do the same: why consider solving ODEs backwards in time, when we can just instead call \(x_0\) the sample from the noise distribution and \(x_1\) the data? To generate new samples with this switch, we then just need to solve the ODEs forward in time.

We now have the challenge of coming up with a model \(v_\theta(x_t,t)\) that will move samples from the noise to the data distribution. This setup again looks a lot like denoising score matching: we could try an objective like this:

\[L_t(\theta) \sim || v_\theta(x_t,t) - v(x_t,t) ||^2\]

but we don’t know the true \(v(x_t,t)\) to train it. Instead we will again replace it by some conditional distribution, similar to when we applied the score function identity earlier:

\[L_t(\theta) = E_{x_t \sim p_t(x_t|z), z \sim \text{prior}} || v_\theta(x_t,t) - v(x_t | z, t) ||^2\]

It turns out that this a good choice - up to a constant, the two losses are equivalent.

This now really looks like we’re generalizing denoising score matching, but we’re free to choose:

What conditional random variable \(z\) to use. Previously, we just considered this to be sample from the data distribution \(z = x_1\).
The prior distribution to sample \(z\) from. Again, previously we just used the data samples we had.
The conditional distribution for \(p_t(x_t \mid z)\). Previously, we assumed an isotropic Gaussian for this (the DDPM forward process).

A popular set of choices for the first two is:

\[z = (x_0, x_1)\]

that is, the condition is a pair of noise and data samples, whose prior is taken to be independent:

\[p_\text{prior}(z) = p_\text{prior,0}(x_0) \cdot p_\text{prior,1}(x_1)\]

What about the choice for the conditional distribution \(p_t(x_t \mid z)\)? Gaussians make our life easy, so let’s stick with the same choice as we made for DDPM:

\[p_t(x_t | z) = N(x_t; \mu(z,t), \sigma^2(z,t) \cdot I)\]

In DDPM we had chosen the forms for the mean and standard deviation from the forward process \(q_t(x_t \mid x_0)\). We ended up with a MSE loss for the noise-corrupted signal \(x_t\) (or after some more algebra, a MSE for the noise itself).

Now instead, we have been able to generalize this to any desired choice for the form of the mean and variance. Once we do make a choice, we’ll need to work out the velocity field, such that we can use it as the target in the loss during training - from:

\[x_t = \mu(z,t) + \sigma(z,t) \cdot \epsilon\]

it’s easy to derive that:

\[v(x_t | z, t) = \frac{d x_t}{dt} = \frac{\sigma^\prime (z,t)}{\sigma(z,t)} (x_t - \mu(z,t)) + \mu^\prime(z,t)\]

For the choice of \(z\) introduced above, the authors used:

\[\mu(z,t) = t x_1 + (1-t) x_0\] \[\sigma(z,t) = \text{const.}\]

which has the obviously desired boundary condition that we end up at our data distribution at the end \(\mu(z,t=1) = x_1\). The resulting conditional velocity field is \(v(x_t \mid z,t) = x_1 - x_0\).

Let’s reflect on the recipe, once again:

(Setup) Choose the definition of the conditional random variable \(z\), its distribution (that is easy to sample from), and the forms for \(\mu(z,t)\) and \(\sigma(z,t)\). Work out the conditional velocity field that you will need as a target.
During training, follow the same recipe as DDPM: choose a time \(t\), choose a data sample \(x_1\) and corrupt it by \(x_t \sim p_t(x_t \mid z)\), where \(z\) is obtained by sampling any other random required random variables. Try to get the network to predict the conditional velocity field.
At inference time, sample some noise \(x_0\), and then choose your favorite ODE solver to solve the ODE forwards in time, replacing the velocity field with the learned model.

Continuous normalizing flows

OK, it’s a lie to call this “from diffusion models to flow matching” when we’re adding another section after flow matching. But there’s another nice connection between flow matching and ODEs the comes through conservation of mass. If you consider \(p\) to describe a cloud of particles, moving according to a velocity field \(v\) that pushes them around, then conservation of mass is formulated as:

\[\frac{\partial p}{\partial t} + \vec{\nabla} \cdot (p \vec{v}) = 0\]

where we’ve dropped the arguments to \(p_t(x_t)\) for brevity, and use the notation:

\[\vec{\nabla} \cdot (p \vec{v}) = p (\vec{\nabla} \cdot \vec{v}) + (\vec{\nabla} p) \cdot \vec{v}\]

The first term measures compression or expansion due to the velocity field, and the second term is the advection term, which measures how the density is transported along the flow.

In flow matching, we constrain the form for the marginal \(p_t(x_t)\) that will transport samples from our noise distribution to our data to be a Gaussian. We then try to get a neural network to predict the (conditional) velocity field that transports samples according to this marginal. Instead, could we relax this constraint and allow any marginal, as long as it obeys conservation of mass? This is the main idea behing continuous normalizing flows.

Just from the chain rule, we know that:

\[\frac{d p}{dt} = \frac{\partial p}{\partial t} + \frac{d\vec{x}}{dt} \cdot \vec{\nabla} p = \frac{\partial p}{\partial t} + \vec{v} \cdot \vec{\nabla} p\]

and then using the continuity equation, we obtain

\[\frac{d p}{dt} = - p (\vec{\nabla} \cdot \vec{v})\] \[\frac{d \ln p}{dt} = - \vec{\nabla} \cdot \vec{v}\] \[\ln p(\vec{x}_1) = \ln p(\vec{x}_0) - \int_0^1 dt \; \vec{\nabla} \cdot \vec{v}\]

where we integrated in time in the last step, and use \(x_0\) and \(x_1\) to denote noise and data samples.

This is a great result - if we can do the integral, we can directly compute and optimize the log likelihood! That means: to train an estimate \(\vec{v}_\theta\) of \(\vec{v}\), take given data samples \(x_1\) and solve backwards in time using:

\[\frac{d \vec{x}_t}{dt} = \vec{v}_\theta(\vec{x}_t,t)\]

Simultaneously we can solve the first equation for the log likelihood, and optimize it directly! We do still have to assume some simple Gaussian form for \(\ln p(\vec{x}_0)\) so that we can compute this term analytically. A final important trick to get this to work in practice is to use Hutchinson’s trace estimator to efficiently compute the divergence terms.

Many applications of continuous normalizing flows have been overtaken by flow matching, but they remain relevant when being able to compute the exact log likelihood is beneficial.

From Diffusion models to Flow Matching

DDPM, aka the original diffusion model

Score-based diffusion models

Flow matching

Continuous normalizing flows

Contents