论文笔记-Denoising Diffusion Probabilistic Models

  • Deep Unsupervised Learning using Nonequilibrium Thermodynamics.

  • DDPM: Denoising Diffusion Probabilistic Models

  • Diffusion models beat gans on image synthesis.

  • Image super-resolution via iterative refinement

  • Cascaded diffusion models for high fidelity image generation

  • Score-based generative modeling in latent space.

  • Discrete diffusion model

    • Structured Denoising Diffusion Models in Discrete State-Spaces.
    • Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive Language Models.
    • Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes 2111.12701.pdf (arxiv.org)
  • parallel generation

    • Generative modeling by estimating gradients of the data distribution
    • Score-based generative modeling through stochastic differential equations.
    • Denoising diffusion implicit models.

Denoising Diffusion Probabilistic Models

Forward diffusion process

$x_t$ 是在 $x_{t-1}$的基础上添加由方差大小 $\beta_t$确定的 Gaussian noise $\mathcal{N}(\sqrt{1-\beta_{t}}, \beta_t)$.
$$
q(x_t|x_{t-1})=\mathcal{N}(x_t, \sqrt{1-\beta_{t}}x_{t-1}, \beta_tI)
$$

$$
q(x_{1:T}|x_0)=\prod_{t=1}^Tq(x_t|x_{t-1})
$$

假定 $\alpha_t=1-\beta_t, \bar{\alpha_t}=\prod_{s=1}^{t}\alpha_s$ .
$$
\begin{aligned}
x_t &= \sqrt{\alpha_t}x_{t-1} + \sqrt{1 - \alpha_t}z_{t-1} & \text{ ;where } z_{t-1}, z_{t-2}, \dots \sim {N}({0}, {I}) \\
&= \color{blue}{\sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1-\alpha_{t-1}}z_{t-2}) + \sqrt{1 - \alpha_t}z_{t-1}}\\
&= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar z_{t-2} & \text{ ;where } \bar z_{t-2} \text{ merges two Gaussians (*).} \\
&= \dots \\
&= \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}z \\
q(x_t \vert x_0) &= \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t){I})
\end{aligned}
$$

上式中的第二步到第三步是两个高斯分布的合并:$\mathcal{N}(0, \sigma_1^2I)$ 和 $\mathcal{N}(0, \sigma_2^2I)$. 得到新的混合分布是 $\mathcal{N}(0, (\sigma_1^2+\sigma_2^2)I)$. 因此合并的标准差是:
$$
\begin{aligned}
& \sqrt{\alpha_t(1-\alpha_{t-1})}z_{t-2} + \sqrt{1-\alpha_t}z_{t-1} \\
&= \sqrt{(1 - \alpha_t) + \alpha_t (1-\alpha_{t-1})} \bar z_{t-2} \\
&= \sqrt{1 - \alpha_t\alpha_{t-1}}\bar z_{t-2} \\
& \bar z_{t-2} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \
\end{aligned}
$$
因此我们可以得到任意time step t的 $x_t$:
$$
q(x_t|x_0)=\mathcal{N}(x_t;\sqrt{\bar \alpha_t}x_0, (1-\bar{\alpha_t})I)
$$
这个前向过程的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from difflib import Differ
import torch
import torch.nn as nn
from typing import Tuple, Optional
import torch
import torch.nn.functional as F
import torch.utils.data
from torch import nn


def gather(consts: torch.Tensor, t: torch.Tensor):
c = consts.gather(-1, t)
return c.reshape(-1, 1, 1, 1)

class DenoiseDiffusion(nn.Module):
def __init__(self, eps_model: nn.Module, n_steps: int, device: torch.device):
super().__init__()
self.eps_model = eps_model
self.beta = torch.linspace(0.0001, 0.02, n_steps).to(device)
self.alpha = 1. - self.beta
self.alpha_bar = torch.cumprod(self.alpha, dim=0)
self.n_steps = n_steps
self.sigma2 = self.beta

def q_xt_x0(self, x0: torch.Tensor, t: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
"""
#### Get $q(x_t|x_0)$ distribution
\begin{align}
q(x_t|x_0) &= \mathcal{N} \Big(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) \mathbf{I} \Big)
\end{align}
"""

# [gather](utils.html) $\alpha_t$ and compute $\sqrt{\bar\alpha_t} x_0$
mean = gather(self.alpha_bar, t) ** 0.5 * x0
# $(1-\bar\alpha_t) \mathbf{I}$
var = 1 - gather(self.alpha_bar, t)
return mean, var

def q_sample(self, x0: torch.Tensor, t: torch.Tensor, eps: Optional[torch.Tensor] = None):
"""
#### Sample from $q(x_t|x_0)$
\begin{align}
q(x_t|x_0) &= \mathcal{N} \Big(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) \mathbf{I} \Big)
\end{align}
"""

# $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
if eps is None:
eps = torch.randn_like(x0)

# get $q(x_t|x_0)$
mean, var = self.q_xt_x0(x0, t)
# Sample from $q(x_t|x_0)$, 得到加了噪声的样本。
# 类似于VAE中的decoder的输入,不过vae的均值和方差是encode得到的,有需要学习的参数。而这里是直接算出来的。
return mean + (var ** 0.5) * eps

if __name__ == "__main__":
ddpm = DenoiseDiffusion(None, 25, "cpu")
x_0 = torch.randn(5, 3, 224, 224)

print(ddpm.alpha_bar)
x_1_mean, x_1_var = ddpm.q_xt_x0(x_0, torch.LongTensor([1]))

print(x_1_mean, x_1_var)

回顾下多维高斯分布的推导:

首先一维正态分布:

$p(x) = \dfrac{1}{\sqrt{2\pi}}exp(\dfrac{-x^2}{2})$

二维标准正态分布,就是两个独立的一维标准正态分布随机变量的联合分布:

$p(x,y) = p(x)p(y)=\dfrac{1}{2\pi}exp(-\dfrac{x^2+y^2}{2})$ . 上面的(3)式中第二步到第三步就是基于这一公式的。

把两个随机变量组合成一个随机向量:$v=[x\quad y]^T$

$p(v)=\dfrac{1}{2\pi}exp(-\dfrac{1}{2}v^Tv)\quad$ 显然x,y相互独立的话,就是上面的二维标准正态分布公式~

然后从标准正态分布推广到一般正态分布,通过一个线性变化:$v=A(x-\mu)$

$p(x)=\dfrac{|A|}{2\pi}exp[-\dfrac{1}{2}(x-\mu)^TA^TA(x-\mu)]$

注意前面的系数多了一个|A|(A的行列式)。

可以证明这个分布的均值为$\mu$,协方差为$(A^TA)^{-1}$。记$\Sigma = (A^TA)^{-1}$,那就有
$$
p(\mathbf{x}) = \frac{1}{2\pi|\Sigma|^{1/2}} \exp \left[ -\frac{1}{2} (\mathbf{x} - \mu) ^T \Sigma^{-1} (\mathbf{x} - \mu) \right]
$$
推广到n维:
$$
p(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp \left[ -\frac{1}{2} (\mathbf{x} - \mu) ^T \Sigma^{-1} (\mathbf{x} - \mu) \right]
$$
对多维随机变量$X=[X_1,X_2,…,X_n]^T$,我们往往需要计算各维度之间的协方差,这样协方差就组成了一个n×n的矩阵,称为协方差矩阵。协方差矩阵是一个对角矩阵,对角线上的元素是各维度上随机变量的方差,非对角线元素是维度之间的协方差。 我们定义协方差为$\Sigma$, 矩阵内的元素$\Sigma_{ij}$为:

$\Sigma_{ij} = cov(X_i,X_j) = E[(X_i - E(X_i)) (X_j - E(X_j))]$

image-20220413105419890

如果X~$N(\mu,\Sigma)$,则$Cov(X)=\Sigma$

可以这么理解协方差,对于n维随机变量X,第一维是体重$X_1$,第二维是颜值$X_2$,显然这两个维度是有一定联系的,就用$cov(X_1,X_2)$来表征,这个值越小,代表他们越相似。协方差怎么求,假设有m个样本,那么所有的样本的第一维就构成$X_1$…不要把$X_1$和样本搞混淆了。

Reverse denoising Process

如果我们能把上面的过程反过来,也就是从 $q(x_{t-1}|x_t)$,我们将能够从高斯噪声 $x_T\sim \mathcal{N}(0,I)$ 输入中重建真实的样本. 不幸的是,我们并不能直接估计 $q(x_{t-1}|x_t)$. 因此我们需要学习一个模型 $p_{\theta}$ 来近似这些条件概率以便进行反向扩散过程。类似于VAE中的变分近似,也就是我们通过另一个判别模型来代替这个后验概率。也就是 $q(x_{t-1}|x_t)\sim p_{\theta}(x_{t-1}|x)$
$$
\begin{aligned}
p(x_T) &= \mathcal{N}(x_T;0,I) \\
p_\theta(\mathbf x_{0:T}) &= p(\mathbf x_T) \prod^T_{t=1} p_\theta(\mathbf x_{t-1} \vert \mathbf x_t) \quad \
\end{aligned}
$$
$x_T \sim \mathcal{N}(0, I)$,在 $\beta_t$ 很小的情况下, $\beta_{t-1}$ 也是高斯分布。
$$
p_\theta(x_{t-1} x_t) = \mathcal{N}( x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta( x_t, t))
$$
类似于VAE,使用神经网络得到均值 $\mu_{\theta}(x_t,t)$ 和方差 $\Sigma_\theta(x_t, t))$.

image-20220410111845243

我们可以通过简单的推导 VLBO:
$$
\begin{aligned}
-\log p_\theta(x_0) &\leq - \log p_\theta(x_0) + D_\text{KL}(q(x_{1:T}x_0) | p_\theta(x_{1:T}x_0) ) \\
&= -\log p_\theta(x_0) + \mathbb E_{x_{1:T}\sim q(x_{1:T} x_0)} \Big[ \log\frac{q(x_{1:T}x_0)}{p_\theta(x_{0:T}) / p_\theta(x_0)} \Big] \\
&= -\log p_\theta(x_0) + \mathbb E_q \Big[ \log\frac{q(x_{1:T}x_0)}{p_\theta(x_{0:T})} + \log p_\theta(x_0) \Big] \\
&= \mathbb E_q \Big[ \log \frac{q(x_{1:T}x_0)}{p_\theta(x_{0:T})} \Big] \\
\text{Let }L_\text{VLB}
&= \mathbb E_{q(x_{0:T})} \Big[ \log \frac{q(x_{1:T}x_0)}{p_\theta(x_{0:T})} \Big]
\end{aligned}
$$

上式中第二步的推导涉及到将KL散度的变换:

$D(q||p)=H(q,p)-H(q)=\sum_{x\sim q(x)}[ln\dfrac{1}{p(x)} - ln\dfrac{1}{q(x)}]=E_{x\sim q(x)}[ln\dfrac{q(x)}{p(x)}]$

q是真实分布,p是预测/需要学习的分布。

$$
\begin{aligned}
L_\text{VLB}
&= \mathbb E_{q(\mathbf x_{0:T})} \Big[ \log\frac{q(\mathbf x_{1:T}\vert\mathbf x_0)}{p_\theta(\mathbf x_{0:T})} \Big] \\
&= \mathbb E_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf x_t\vert\mathbf x_{t-1})}{ p_\theta(\mathbf x_T) \prod_{t=1}^T p_\theta(\mathbf x_{t-1} \vert\mathbf x_t) } \Big] \\
&= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=1}^T \log \frac{q(\mathbf x_t\vert\mathbf x_{t-1})}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} \Big] \\
&= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=2}^T \log \frac{q(\mathbf x_t\vert\mathbf x_{t-1})}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} + \log\frac{q(\mathbf x_1 \vert \mathbf x_0)}{p_\theta(\mathbf x_0 \vert \mathbf x_1)} \Big] \\
&= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)}\cdot \frac{q(\mathbf x_t \vert \mathbf x_0)}{q(\mathbf x_{t-1}\vert\mathbf x_0)} \Big) + \log \frac{q(\mathbf x_1 \vert \mathbf x_0)}{p_\theta(\mathbf x_0 \vert \mathbf x_1)} \Big] \\
&= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=2}^T \log \frac{q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} + \sum_{t=2}^T \log \frac{q(\mathbf x_t \vert \mathbf x_0)}{q(\mathbf x_{t-1} \vert \mathbf x_0)} + \log\frac{q(\mathbf x_1 \vert \mathbf x_0)}{p_\theta(\mathbf x_0 \vert \mathbf x_1)} \Big] \\
&= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=2}^T \log \frac{q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} + \log\frac{q(\mathbf x_T \vert \mathbf x_0)}{q(\mathbf x_1 \vert \mathbf x_0)} + \log \frac{q(\mathbf x_1 \vert \mathbf x_0)}{p_\theta(\mathbf x_0 \vert \mathbf x_1)} \Big] \\
&= \mathbb E_q \Big[ \log\frac{q(\mathbf x_T \vert \mathbf x_0)}{p_\theta(\mathbf x_T)} + \sum_{t=2}^T \log \frac{q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} - \log p_\theta(\mathbf x_0 \vert \mathbf x_1) \Big] \\
&= \mathbb E_q [D_\text{KL}(q(\mathbf x_T \vert \mathbf x_0) \parallel p_\theta(\mathbf x_T)) + \sum_{t=2}^T D_\text{KL}(q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0) \parallel p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)) - \log p_\theta(\mathbf x_0 \vert \mathbf x_1)]
\end{aligned}
$$
上式的第二步利用了马尔可夫链的规则得到的,也就是 $z_2$ 与 $x$ 无关:
$$
q(z_1,z_2|x) = q((z_2|z_1)|x)q(z_1|x) = q(z_2|z_1)q(z_1|x)
$$
ELB可以分为以下三部分:
$$
\begin{aligned}
L_\text{VLB} &= L_T + L_{T-1} + \dots + L_0 \\
\text{where } L_T &= D_\text{KL}(q(\mathbf x_T \vert \mathbf x_0) \parallel p_\theta(\mathbf x_T)) \\
L_t &= D_\text{KL}(q(\mathbf x_{t-1} \vert \mathbf x_{t}, \mathbf x_0) \parallel p_\theta(\mathbf x_{t-1} \vert\mathbf x_{t})) \text{ for }2 \leq t \leq T \\
L_0 &= - \log p_\theta(\mathbf x_0 \vert \mathbf x_1)
\end{aligned}
$$
其中 $L_T$ 是常数项,因为 $q(x_T|x_0)$ 没有参数,且 $x_T\sim \mathcal{N}(0, I)$.

接下来我们计算 $L_t$的第一项:

根据贝叶斯公式
$$
q(x_{t-1}|x_t, x_0)q(x_t|x_0)= q(x_{t-1}|x_t|x_0)q(x_t|x_0) = q(x_t|x_{t-1}|x_0)q(x_{t-1}|x_0)
$$
我们可以得到:
$$
\begin{aligned}
q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)
&= q(\mathbf x_t \vert \mathbf x_{t-1}, \mathbf x_0) \frac{ q(\mathbf x_{t-1} \vert \mathbf x_0) }{ q(\mathbf x_t \vert \mathbf x_0) } , \color{blue}{\text{第一项根据公式(1), 第二三项根据公式 (5)}}\\
&\color{blue}{\propto \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, \beta_t) \mathcal{N}(x_{t-1};\sqrt{\bar \alpha_{t-1}}x_0, (1-\bar{\alpha_{t-1}})I) /\mathcal{N}(x_t;\sqrt{\bar \alpha_t}x_0, (1-\bar{\alpha_t})I)} \\
&\propto \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf x_t - \sqrt{\alpha_t} \mathbf x_{t-1})^2}{\beta_t} + \frac{(\mathbf x_{t-1} - \sqrt{\bar \alpha_{t-1}} \mathbf x_0)^2}{1-\bar \alpha_{t-1}} - \frac{(\mathbf x_t - \sqrt{\bar \alpha_t} \mathbf x_0)^2}{1-\bar \alpha_t} \big) \Big) \\
&= \exp\Big( -\frac{1}{2} \big( \color{red}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}})} \mathbf x_{t-1}^2 - \color{blue}{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf x_t + \frac{2\sqrt{\bar \alpha_t}}{1 - \bar \alpha_t} \mathbf x_0)} \mathbf x_{t-1} + C(\mathbf x_t, \mathbf x_0) \big) \Big)
\end{aligned}
$$
其中:
$$
\begin{aligned}
\tilde \beta_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}}) = \frac{1 - \bar \alpha_{t-1}}{1 - \bar \alpha_t} \cdot \beta_t, \text{带入} \alpha_t=1-\beta_t, \bar \alpha_t=\prod_{s=1}^{t}\alpha_s \text{可得}\\
\tilde \mu_t (\mathbf x_t, \mathbf x_0)
&= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf x_t + \frac{\sqrt{\bar \alpha_t}}{1 - \bar \alpha_t} \mathbf x_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}})
= \frac{\sqrt{\alpha_t}(1 - \bar \alpha_{t-1})}{1 - \bar \alpha_t} \mathbf x_t + \frac{\sqrt{\bar \alpha_{t-1}}\beta_t}{1 - \bar \alpha_t} \mathbf x_0
\end{aligned}
$$
因此可以得到 $L_t$ 的第一项可以写成:
$$
q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0) = \mathcal{N}(\mathbf x_{t-1}; \color{blue}{\tilde \mu(\mathbf x_t, \mathbf x_0)}, \color{red}{\tilde \beta_t \mathbf I)}
$$
接下来计算 $L_t$ 的第二项:
$$
p_\theta(\mathbf x_{t-1} \vert \mathbf x_t) = \mathcal{N}(\mathbf x_{t-1}; \mu_\theta(\mathbf x_t, t), \Sigma_\theta(\mathbf x_t, t))
$$
其中:
$$
\begin{aligned}
\mu_\theta(\mathbf x_t, t) &= \color{cyan}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf x_t - \frac{\beta_t}{\sqrt{1 - \bar \alpha_t}} \mathbf z_\theta(\mathbf x_t, t) \Big)} \\
\text{Thus }\mathbf x_{t-1} &= \mathcal{N}(\mathbf x_{t-1}; \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf x_t - \frac{\beta_t}{\sqrt{1 - \bar \alpha_t}} \mathbf z_\theta(\mathbf x_t, t) \Big), \Sigma_\theta(\mathbf x_t, t))
\end{aligned}
$$

看代码我们会发现 eps_model(xt, t) 并不是直接得到反向decoder($x_t \rightarrow x_{t-1}$)的图片,而是得到的 $z_{\theta}(x_t, t)$. 然后会很巧的发现计算loss的时候,$x_t$ 被消除了,变成了 $z_t\sim \mathcal{N}(0,I)$ 和 $z_\theta(x_t, t)$ 的mse.

因此 $L_t$ 可以写成 $\color{blue}{\tilde \mu_t(x_t, x_0)}$ 和 $\color{green}{\mu_\theta(x_t, t)}$ 的均方差形式:
$$
\begin{aligned}
L_t &= \mathbb E_{\mathbf x_0, \mathbf z} \Big[\frac{1}{2 | \Sigma_\theta(\mathbf x_t, t) |^2_2} | \color{blue}{\tilde \mu_t(\mathbf x_t, \mathbf x_0)} - \color{green}{\mu_\theta(\mathbf x_t, t)} |^2 \Big] \\
&= \mathbb E_{\mathbf x_0, \mathbf z} \Big[\frac{1}{2 |\Sigma_\theta |^2_2} | \color{blue}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf x_t - \frac{\beta_t}{\sqrt{1 - \bar \alpha_t}} \mathbf z_t \Big)} - \color{green}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf x_t - \frac{\beta_t}{\sqrt{1 - \bar \alpha_t}} \mathbf z_\theta(\mathbf x_t, t) \Big)} |^2 \Big] \\
&= \mathbb E_{\mathbf x_0, \mathbf z} \Big[\frac{ \beta_t^2 }{2 \alpha_t (1 - \bar \alpha_t) | \Sigma_\theta |^2_2} |\mathbf z_t - \mathbf z_\theta(\mathbf x_t, t)|^2 \Big] \\
&= \mathbb E_{\mathbf x_0, \mathbf z} \Big[\frac{ \beta_t^2 }{2 \alpha_t (1 - \bar \alpha_t) | \Sigma_\theta |^2_2} |\mathbf z_t - \mathbf z_\theta(\sqrt{\bar \alpha_t}\mathbf x_0 + \sqrt{1 - \bar \alpha_t}\mathbf z_t, t)|^2 \Big]
\end{aligned}
$$
简化之后就是:
$$
L_t^\text{simple} = \mathbb E_{\mathbf x_0, \mathbf z_t} \Big[|\mathbf z_t - \mathbf z_\theta(\sqrt{\bar \alpha_t}\mathbf x_0 + \sqrt{1 - \bar \alpha_t}\mathbf z_t, t)|^2 \Big]
$$
反向的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

class DenoiseDiffusion(nn.Module):
...

def p_sample(self, xt: torch.Tensor, t: torch.Tensor):
""" 上述公式中 L_t 的第二项
#### Sample from $\textcolor{cyan}{p_\theta}(x_{t-1}|x_t)$
\begin{align}
\textcolor{cyan}{p_\theta}(x_{t-1} | x_t) &= \mathcal{N}\big(x_{t-1};
\textcolor{cyan}{\mu_\theta}(x_t, t), \sigma_t^2 \mathbf{I} \big) \\
\textcolor{cyan}{\mu_\theta}(x_t, t)
&= \frac{1}{\sqrt{\alpha_t}} \Big(x_t -
\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\textcolor{cyan}{\epsilon_\theta}(x_t, t) \Big)
\end{align}
"""

# $\textcolor{cyan}{\epsilon_\theta}(x_t, t)$
eps_theta = self.eps_model(xt, t) # 对应上上公式中的 z_\theta(xt, t)
# [gather](utils.html) $\bar\alpha_t$
alpha_bar = gather(self.alpha_bar, t)
# $\alpha_t$
alpha = gather(self.alpha, t)
# $\frac{\beta}{\sqrt{1-\bar\alpha_t}}$
eps_coef = (1 - alpha) / (1 - alpha_bar) ** .5
# $$\frac{1}{\sqrt{\alpha_t}} \Big(x_t -
# \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\textcolor{cyan}{\epsilon_\theta}(x_t, t) \Big)$$
mean = 1 / (alpha ** 0.5) * (xt - eps_coef * eps_theta)
# $\sigma^2$
var = gather(self.sigma2, t)

# $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
eps = torch.randn(xt.shape, device=xt.device)
# Sample
return mean + (var ** .5) * eps

def loss(self, x0: torch.Tensor, noise: Optional[torch.Tensor] = None):
"""
#### Simplified Loss
$$L_simple(\theta) = \mathbb{E}_{t,x_0, \epsilon} \Bigg[ \bigg\Vert
\epsilon - \textcolor{cyan}{\epsilon_\theta}(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon, t)
\bigg\Vert^2 \Bigg]$$
"""
# Get batch size
batch_size = x0.shape[0]
# Get random $t$ for each sample in the batch
t = torch.randint(0, self.n_steps, (batch_size,), device=x0.device, dtype=torch.long)

# $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
if noise is None:
noise = torch.randn_like(x0)

# Sample $x_t$ for $q(x_t|x_0)$
xt = self.q_sample(x0, t, eps=noise)
# Get $\textcolor{cyan}{\epsilon_\theta}(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon, t)$
eps_theta = self.eps_model(xt, t)

# MSE loss
return F.mse_loss(noise, eps_theta) # 上上个公式中第三行,

训练和infer代码:

训练时,每一个batch只是随机其中一个time step进行训练。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def sample(self):
"""
### Sample images
"""
with torch.no_grad():
# $x_T \sim p(x_T) = \mathcal{N}(x_T; \mathbf{0}, \mathbf{I})$
x = torch.randn([self.n_samples, self.image_channels, self.image_size, self.image_size],
device=self.device)

# Remove noise for $T$ steps
for t_ in monit.iterate('Sample', self.n_steps):
# $t$
t = self.n_steps - t_ - 1
# Sample from $\textcolor{cyan}{p_\theta}(x_{t-1}|x_t)$
x = self.diffusion.p_sample(x, x.new_full((self.n_samples,), t, dtype=torch.long))

# Log samples
tracker.save('sample', x)

def train(self):
"""
### Train
"""

# Iterate through the dataset
for data in monit.iterate('Train', self.data_loader):
# Increment global step
tracker.add_global_step()
# Move data to device
data = data.to(self.device)

# Make the gradients zero
self.optimizer.zero_grad()
# Calculate loss
loss = self.diffusion.loss(data)
# Compute gradients
loss.backward()
# Take an optimization step
self.optimizer.step()
# Track the loss
tracker.save('loss', loss)

Improved Denoising Diffusion Probabilistic Models

Reference

作者

Xie Pan

发布于

2022-02-05

更新于

2022-04-26

许可协议

评论