# 论文笔记-Denoising Diffusion Probabilistic Models

• Deep Unsupervised Learning using Nonequilibrium Thermodynamics.

• DDPM: Denoising Diffusion Probabilistic Models

• Diffusion models beat gans on image synthesis.

• Image super-resolution via iterative refinement

• Cascaded diffusion models for high fidelity image generation

• Score-based generative modeling in latent space.

• Discrete diffusion model

• Structured Denoising Diffusion Models in Discrete State-Spaces.
• Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive Language Models.
• Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes 2111.12701.pdf (arxiv.org)
• parallel generation

• Generative modeling by estimating gradients of the data distribution
• Score-based generative modeling through stochastic differential equations.
• Denoising diffusion implicit models.

### Denoising Diffusion Probabilistic Models

#### Forward diffusion process

$x_t$ 是在 $x_{t-1}$的基础上添加由方差大小 $\beta_t$确定的 Gaussian noise $\mathcal{N}(\sqrt{1-\beta_{t}}, \beta_t)$.
$$q(x_t|x_{t-1})=\mathcal{N}(x_t, \sqrt{1-\beta_{t}}x_{t-1}, \beta_tI)$$

$$q(x_{1:T}|x_0)=\prod_{t=1}^Tq(x_t|x_{t-1})$$

\begin{aligned} x_t &= \sqrt{\alpha_t}x_{t-1} + \sqrt{1 - \alpha_t}z_{t-1} & \text{ ;where } z_{t-1}, z_{t-2}, \dots \sim {N}({0}, {I}) \\ &= \color{blue}{\sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1-\alpha_{t-1}}z_{t-2}) + \sqrt{1 - \alpha_t}z_{t-1}}\\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar z_{t-2} & \text{ ;where } \bar z_{t-2} \text{ merges two Gaussians (*).} \\ &= \dots \\ &= \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}z \\ q(x_t \vert x_0) &= \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t){I}) \end{aligned}

\begin{aligned} & \sqrt{\alpha_t(1-\alpha_{t-1})}z_{t-2} + \sqrt{1-\alpha_t}z_{t-1} \\ &= \sqrt{(1 - \alpha_t) + \alpha_t (1-\alpha_{t-1})} \bar z_{t-2} \\ &= \sqrt{1 - \alpha_t\alpha_{t-1}}\bar z_{t-2} \\ & \bar z_{t-2} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \ \end{aligned}

$$q(x_t|x_0)=\mathcal{N}(x_t;\sqrt{\bar \alpha_t}x_0, (1-\bar{\alpha_t})I)$$

$p(x) = \dfrac{1}{\sqrt{2\pi}}exp(\dfrac{-x^2}{2})$

$p(x,y) = p(x)p(y)=\dfrac{1}{2\pi}exp(-\dfrac{x^2+y^2}{2})$ . 上面的（3）式中第二步到第三步就是基于这一公式的。

$p(v)=\dfrac{1}{2\pi}exp(-\dfrac{1}{2}v^Tv)\quad$ 显然x,y相互独立的话，就是上面的二维标准正态分布公式～

$p(x)=\dfrac{|A|}{2\pi}exp[-\dfrac{1}{2}(x-\mu)^TA^TA(x-\mu)]$

$$p(\mathbf{x}) = \frac{1}{2\pi|\Sigma|^{1/2}} \exp \left[ -\frac{1}{2} (\mathbf{x} - \mu) ^T \Sigma^{-1} (\mathbf{x} - \mu) \right]$$

$$p(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp \left[ -\frac{1}{2} (\mathbf{x} - \mu) ^T \Sigma^{-1} (\mathbf{x} - \mu) \right]$$

$\Sigma_{ij} = cov(X_i,X_j) = E[(X_i - E(X_i)) (X_j - E(X_j))]$

#### Reverse denoising Process

\begin{aligned} p(x_T) &= \mathcal{N}(x_T;0,I) \\ p_\theta(\mathbf x_{0:T}) &= p(\mathbf x_T) \prod^T_{t=1} p_\theta(\mathbf x_{t-1} \vert \mathbf x_t) \quad \ \end{aligned}
$x_T \sim \mathcal{N}(0, I)$，在 $\beta_t$ 很小的情况下， $\beta_{t-1}$ 也是高斯分布。
$$p_\theta(x_{t-1} x_t) = \mathcal{N}( x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta( x_t, t))$$

\begin{aligned} -\log p_\theta(x_0) &\leq - \log p_\theta(x_0) + D_\text{KL}(q(x_{1:T}x_0) | p_\theta(x_{1:T}x_0) ) \\ &= -\log p_\theta(x_0) + \mathbb E_{x_{1:T}\sim q(x_{1:T} x_0)} \Big[ \log\frac{q(x_{1:T}x_0)}{p_\theta(x_{0:T}) / p_\theta(x_0)} \Big] \\ &= -\log p_\theta(x_0) + \mathbb E_q \Big[ \log\frac{q(x_{1:T}x_0)}{p_\theta(x_{0:T})} + \log p_\theta(x_0) \Big] \\ &= \mathbb E_q \Big[ \log \frac{q(x_{1:T}x_0)}{p_\theta(x_{0:T})} \Big] \\ \text{Let }L_\text{VLB} &= \mathbb E_{q(x_{0:T})} \Big[ \log \frac{q(x_{1:T}x_0)}{p_\theta(x_{0:T})} \Big] \end{aligned}

$D(q||p)=H(q,p)-H(q)=\sum_{x\sim q(x)}[ln\dfrac{1}{p(x)} - ln\dfrac{1}{q(x)}]=E_{x\sim q(x)}[ln\dfrac{q(x)}{p(x)}]$

q是真实分布，p是预测/需要学习的分布。

\begin{aligned} L_\text{VLB} &= \mathbb E_{q(\mathbf x_{0:T})} \Big[ \log\frac{q(\mathbf x_{1:T}\vert\mathbf x_0)}{p_\theta(\mathbf x_{0:T})} \Big] \\ &= \mathbb E_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf x_t\vert\mathbf x_{t-1})}{ p_\theta(\mathbf x_T) \prod_{t=1}^T p_\theta(\mathbf x_{t-1} \vert\mathbf x_t) } \Big] \\ &= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=1}^T \log \frac{q(\mathbf x_t\vert\mathbf x_{t-1})}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} \Big] \\ &= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=2}^T \log \frac{q(\mathbf x_t\vert\mathbf x_{t-1})}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} + \log\frac{q(\mathbf x_1 \vert \mathbf x_0)}{p_\theta(\mathbf x_0 \vert \mathbf x_1)} \Big] \\ &= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)}\cdot \frac{q(\mathbf x_t \vert \mathbf x_0)}{q(\mathbf x_{t-1}\vert\mathbf x_0)} \Big) + \log \frac{q(\mathbf x_1 \vert \mathbf x_0)}{p_\theta(\mathbf x_0 \vert \mathbf x_1)} \Big] \\ &= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=2}^T \log \frac{q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} + \sum_{t=2}^T \log \frac{q(\mathbf x_t \vert \mathbf x_0)}{q(\mathbf x_{t-1} \vert \mathbf x_0)} + \log\frac{q(\mathbf x_1 \vert \mathbf x_0)}{p_\theta(\mathbf x_0 \vert \mathbf x_1)} \Big] \\ &= \mathbb E_q \Big[ -\log p_\theta(\mathbf x_T) + \sum_{t=2}^T \log \frac{q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} + \log\frac{q(\mathbf x_T \vert \mathbf x_0)}{q(\mathbf x_1 \vert \mathbf x_0)} + \log \frac{q(\mathbf x_1 \vert \mathbf x_0)}{p_\theta(\mathbf x_0 \vert \mathbf x_1)} \Big] \\ &= \mathbb E_q \Big[ \log\frac{q(\mathbf x_T \vert \mathbf x_0)}{p_\theta(\mathbf x_T)} + \sum_{t=2}^T \log \frac{q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)}{p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)} - \log p_\theta(\mathbf x_0 \vert \mathbf x_1) \Big] \\ &= \mathbb E_q [D_\text{KL}(q(\mathbf x_T \vert \mathbf x_0) \parallel p_\theta(\mathbf x_T)) + \sum_{t=2}^T D_\text{KL}(q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0) \parallel p_\theta(\mathbf x_{t-1} \vert\mathbf x_t)) - \log p_\theta(\mathbf x_0 \vert \mathbf x_1)] \end{aligned}

$$q(z_1,z_2|x) = q((z_2|z_1)|x)q(z_1|x) = q(z_2|z_1)q(z_1|x)$$
ELB可以分为以下三部分：
\begin{aligned} L_\text{VLB} &= L_T + L_{T-1} + \dots + L_0 \\ \text{where } L_T &= D_\text{KL}(q(\mathbf x_T \vert \mathbf x_0) \parallel p_\theta(\mathbf x_T)) \\ L_t &= D_\text{KL}(q(\mathbf x_{t-1} \vert \mathbf x_{t}, \mathbf x_0) \parallel p_\theta(\mathbf x_{t-1} \vert\mathbf x_{t})) \text{ for }2 \leq t \leq T \\ L_0 &= - \log p_\theta(\mathbf x_0 \vert \mathbf x_1) \end{aligned}

$$q(x_{t-1}|x_t, x_0)q(x_t|x_0)= q(x_{t-1}|x_t|x_0)q(x_t|x_0) = q(x_t|x_{t-1}|x_0)q(x_{t-1}|x_0)$$

\begin{aligned} q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0) &= q(\mathbf x_t \vert \mathbf x_{t-1}, \mathbf x_0) \frac{ q(\mathbf x_{t-1} \vert \mathbf x_0) }{ q(\mathbf x_t \vert \mathbf x_0) } , \color{blue}{\text{第一项根据公式(1), 第二三项根据公式 (5)}}\\ &\color{blue}{\propto \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, \beta_t) \mathcal{N}(x_{t-1};\sqrt{\bar \alpha_{t-1}}x_0, (1-\bar{\alpha_{t-1}})I) /\mathcal{N}(x_t;\sqrt{\bar \alpha_t}x_0, (1-\bar{\alpha_t})I)} \\ &\propto \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf x_t - \sqrt{\alpha_t} \mathbf x_{t-1})^2}{\beta_t} + \frac{(\mathbf x_{t-1} - \sqrt{\bar \alpha_{t-1}} \mathbf x_0)^2}{1-\bar \alpha_{t-1}} - \frac{(\mathbf x_t - \sqrt{\bar \alpha_t} \mathbf x_0)^2}{1-\bar \alpha_t} \big) \Big) \\ &= \exp\Big( -\frac{1}{2} \big( \color{red}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}})} \mathbf x_{t-1}^2 - \color{blue}{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf x_t + \frac{2\sqrt{\bar \alpha_t}}{1 - \bar \alpha_t} \mathbf x_0)} \mathbf x_{t-1} + C(\mathbf x_t, \mathbf x_0) \big) \Big) \end{aligned}

\begin{aligned} \tilde \beta_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}}) = \frac{1 - \bar \alpha_{t-1}}{1 - \bar \alpha_t} \cdot \beta_t, \text{带入} \alpha_t=1-\beta_t, \bar \alpha_t=\prod_{s=1}^{t}\alpha_s \text{可得}\\ \tilde \mu_t (\mathbf x_t, \mathbf x_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf x_t + \frac{\sqrt{\bar \alpha_t}}{1 - \bar \alpha_t} \mathbf x_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}}) = \frac{\sqrt{\alpha_t}(1 - \bar \alpha_{t-1})}{1 - \bar \alpha_t} \mathbf x_t + \frac{\sqrt{\bar \alpha_{t-1}}\beta_t}{1 - \bar \alpha_t} \mathbf x_0 \end{aligned}

$$q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0) = \mathcal{N}(\mathbf x_{t-1}; \color{blue}{\tilde \mu(\mathbf x_t, \mathbf x_0)}, \color{red}{\tilde \beta_t \mathbf I)}$$

$$p_\theta(\mathbf x_{t-1} \vert \mathbf x_t) = \mathcal{N}(\mathbf x_{t-1}; \mu_\theta(\mathbf x_t, t), \Sigma_\theta(\mathbf x_t, t))$$

\begin{aligned} \mu_\theta(\mathbf x_t, t) &= \color{cyan}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf x_t - \frac{\beta_t}{\sqrt{1 - \bar \alpha_t}} \mathbf z_\theta(\mathbf x_t, t) \Big)} \\ \text{Thus }\mathbf x_{t-1} &= \mathcal{N}(\mathbf x_{t-1}; \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf x_t - \frac{\beta_t}{\sqrt{1 - \bar \alpha_t}} \mathbf z_\theta(\mathbf x_t, t) \Big), \Sigma_\theta(\mathbf x_t, t)) \end{aligned}

\begin{aligned} L_t &= \mathbb E_{\mathbf x_0, \mathbf z} \Big[\frac{1}{2 | \Sigma_\theta(\mathbf x_t, t) |^2_2} | \color{blue}{\tilde \mu_t(\mathbf x_t, \mathbf x_0)} - \color{green}{\mu_\theta(\mathbf x_t, t)} |^2 \Big] \\ &= \mathbb E_{\mathbf x_0, \mathbf z} \Big[\frac{1}{2 |\Sigma_\theta |^2_2} | \color{blue}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf x_t - \frac{\beta_t}{\sqrt{1 - \bar \alpha_t}} \mathbf z_t \Big)} - \color{green}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf x_t - \frac{\beta_t}{\sqrt{1 - \bar \alpha_t}} \mathbf z_\theta(\mathbf x_t, t) \Big)} |^2 \Big] \\ &= \mathbb E_{\mathbf x_0, \mathbf z} \Big[\frac{ \beta_t^2 }{2 \alpha_t (1 - \bar \alpha_t) | \Sigma_\theta |^2_2} |\mathbf z_t - \mathbf z_\theta(\mathbf x_t, t)|^2 \Big] \\ &= \mathbb E_{\mathbf x_0, \mathbf z} \Big[\frac{ \beta_t^2 }{2 \alpha_t (1 - \bar \alpha_t) | \Sigma_\theta |^2_2} |\mathbf z_t - \mathbf z_\theta(\sqrt{\bar \alpha_t}\mathbf x_0 + \sqrt{1 - \bar \alpha_t}\mathbf z_t, t)|^2 \Big] \end{aligned}

$$L_t^\text{simple} = \mathbb E_{\mathbf x_0, \mathbf z_t} \Big[|\mathbf z_t - \mathbf z_\theta(\sqrt{\bar \alpha_t}\mathbf x_0 + \sqrt{1 - \bar \alpha_t}\mathbf z_t, t)|^2 \Big]$$

Xie Pan

2022-02-05

2022-04-26