论文笔记-Discrete Denoising Diffusion Model

  • Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive Language Models.

  • Structured Denoising Diffusion Models in Discrete State-Spaces.

  • Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes 2111.12701.pdf (arxiv.org)

  • Vector Quantized Diffusion Model for Text-to-Image Synthesis, CVPR2022

Structured Denoising Diffusion Models in Discrete State-Spaces

Contribution:

  • most recent work has focused on Gaussian diffusion processes that operate in continuous state spaces (e.g. for real-valued image and waveform data). This work is to improve and extend discrete diffusion models by using a more structured categorical corruption process to shape data generation.

  • corruption processes 提出了三种不同的前向corruption过程

    • transition matrices that mimic Gaussian kernels in continuous space

    • matrices based on nearest neighbors in embedding space

    • matrices that introduce absorbing states

  • 并且提出了一种新的loss:that combines the variational lower bound with an auxiliary cross entropy loss

image-20220420110714319

forward diffusion pass

$$
\begin{aligned}
Q^t_{ij} &= q(x_t=j|x_{t-1}=i) \\
q(x_t|x_{t-1})&=\mathcal{Cat}(x_t;p=x_{t-1}Q_t)
\end{aligned}
$$

对应到连续空间里的公式就是: $q(x_t|x_{t-1})=\mathcal{N}(x_t, \sqrt{1-\beta_{t}}x_{t-1}, \beta_tI)$. 连续空间是增加由 $\beta_t$ 确定的noise,离散空间怎么处理呢?

作者提出了四种方法:

  1. Diffusion with an absorbing state

    image-20220425154653176

​ 换言之,就是根据 $\beta_t$ 进行 mask.

  1. Uniform diffusion
image-20220425154610463

​ 也就是有 $1-\dfrac{K-1}{K}\beta_t$ 的概率保持不变,有 $\beta_t$ 的概率变成其他的token.

  1. Discretized Gaussian transition matrices
  2. Structured diffusion in text: using word-embedding distance to introduce locality

High-Resolution Image Synthesis with Latent Diffusion Models

Perceptual Image Compression

image-20220409135731413

Latent Diffusion Models

image-20220409135426726

这篇paper并没有使用离散的token,而是直接使用连续的latent feature.

Unleashing Transformers

Two stage training, the first stage is latent codes learning using VQVAE or VQ-GAN.

The second is BERT-like style training using absorbing state diffusion. Specifically, in each forward time step

image-20211216111406089

The ELBO is parameterized as:

image-20211216141119468

这里的

image-20211216141852217

相对直接maximum上面那个公式的ELBO,作者将每个step的KL散度加了个权重。

Generating High-Resolution Images allows globally consistent images substantially larger than those in the training data to be generated.

image-20211216143138974

where

Improving Code Representations

image-20211216143433013

Vector Quantized Diffusion Model for Text-to-Image Synthesis

image-20220423192804839

离散的data相比连续的data,在使用diffusion model时,forward fussion process会有明显的区别。上一篇unleashing trasformer只是单纯对 $x_0$ 进行随机mask后得到 $x_t$. 但是 $x_{t-1}$ 和 $x_t$ 之间并没有联系,这其实并不是一个真正的markov chain。 这篇paper给出了一种更为适合的forward diffusion方法。

forward diffusion process

$$
\begin{aligned}
Q^t_{mn} = q(x_t=m|x_{t-1}=n) \in R^{K\times K} \\
q(x_t|x_{t-1}) = v^T(x_t)Q_tv(x_{t-1}) \in R
\end{aligned}
$$

$v_t \sim R^{K\times 1}$ is one-hot vector. $v_t$ is given by $Q_{t}v_{t-1}$.

然后 $Q_t$ 如何加 noise 呢?

[Structured ddm] 提出的uniform noise to the categorical distribution可以写成:

image-20220424105525246

其中 $\alpha_t \in [0,1]$, $\beta_t = (1-\alpha_t)/K$. (这里不太理解,就是 $\alpha_t$ 只能是 0或1??)

这里我们假设 $x_{t-1}=[0,1,0,…]$, 那么 $x_t=[\beta_t, \beta_t+\alpha_t, \beta_t, \beta_t, \beta_t, …\beta_t]$. 也就是说 经过一次diffusion noise,$\alpha_t+\beta_t$ 的概率保留原来的token, $\beta_t$ 的概率分配给词表中的$K-1$ 个其他的词。

在这篇paper中,作者觉得使用上述方法存在两个问题:

  • an image token may be replaced to an utterly uncorrelated category, which leads to an abrupt semantic change for that token.
  • the network has to take extra efforts to figure out the tokens that have been replaced prior to fixing them.

针对上述俩问题,作者提出新的nose方式:Mask-and-replace diffusion strategy,也就是增加了一个新的 [MASK] token, $Q_t\in R^{K \times K}$

image-20220424154742641

其中 假设: $x_{t-1}=[0,1,0,…,0]$, 那么 $x_{t}=[\beta_t, \beta_t+\alpha_t, \beta_t, …, \beta_t, \gamma_t]$. 也就是经过一次diffusion noise,$\gamma_t$ 的概率被 [MASK] 替换,$\alpha_t + \beta_t$ 的概率保持不变,$\beta_t$ 的概率被替换成 K-1个其他词。

如果 $x_{t-1} = [0,0,0,…,0,1] = [MASK]$, $x_t=[0,0,0,…,0,1]=[MASK]$. 也就是如果 T 足够大, 最后就会变成全是 [MASK].

使用这样的transition matrix的优点:

  • 被corrupted tokens可以轻易的被识别,比如被 [MASK] 替换掉的,这样可以加速reverse process.
  • 相比单纯的被 [MASK] 替换,作者证明使用一定比例的uniform noise可以得到更 trivial 的posterior.
  • random替换可以使得模型forcus所有的tokens,不仅仅是[MASK].

类似于连续的ddpm,我们同样可以通过推导得到 $x_t=\bar Q_tv(x_0)$.

推导可以得到: $\bar Q_tv(x_0)=\bar \alpha v(x_0) + (\bar \gamma_t - \bar\beta_t)v(K+1) + \bar \beta_t$.

其中 $\bar \alpha_t = \prod_{i=1}^t\alpha_i, \bar \gamma_t = 1 - \prod_{i=1}^t(1-\gamma_i), \bar \beta_t = (1- \bar \alpha_t - \bar \gamma_t)/K$.

作者在附录中给出了粗略的证明,在这里我通过 $x_2=Q_2Q_1x_0$ 给出了详细的推导:

当 $x=x_0$ 时,$Q_{t+1}v(x_0) = \bar \alpha_{t+1} + \bar \beta_{t+1}$.

image-20220425135959343

当 $x=K+1$ 时,$Q_{t+1}v(x_0)=\bar \gamma_{t+1}$

image-20220425141509411

当 $x\ne x_0, x\ne K+1$时,$Q_{t+1}v(x_0)=\bar \beta_{t+1}$ .

image-20220425151034796

reverse denoising process

image-20220425151855685

其中 $p(x_T)$ 是T timestep的先验分布,对于Mask-and-replace diffusion strategy,其先验是:

image-20220425152035126

Note that since the transition matrix $Q_t$ is fixed in the train- ing, the $L_T$ is a constant number which measures the gap between the training and inference and can be ignored in the training.

连续数据的ddpm 优化的是 $q(\mathbf x_{t-1} \vert \mathbf x_t, \mathbf x_0)$。 对于离散的数据,我们可以直接优化: $q(\tilde x_0 \vert \mathbf x_t, \mathbf x_0)$

image-20220425154526681
作者

Xie Pan

发布于

2022-04-20

更新于

2022-04-26

许可协议

评论