# 论文笔记-Discrete Denoising Diffusion Model

• Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive Language Models.

• Structured Denoising Diffusion Models in Discrete State-Spaces.

• Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes 2111.12701.pdf (arxiv.org)

• Vector Quantized Diffusion Model for Text-to-Image Synthesis, CVPR2022

### Structured Denoising Diffusion Models in Discrete State-Spaces

Contribution:

• most recent work has focused on Gaussian diffusion processes that operate in continuous state spaces (e.g. for real-valued image and waveform data). This work is to improve and extend discrete diffusion models by using a more structured categorical corruption process to shape data generation.

• corruption processes 提出了三种不同的前向corruption过程

• transition matrices that mimic Gaussian kernels in continuous space

• matrices based on nearest neighbors in embedding space

• matrices that introduce absorbing states

• 并且提出了一种新的loss：that combines the variational lower bound with an auxiliary cross entropy loss

#### forward diffusion pass

\begin{aligned} Q^t_{ij} &= q(x_t=j|x_{t-1}=i) \\ q(x_t|x_{t-1})&=\mathcal{Cat}(x_t;p=x_{t-1}Q_t) \end{aligned}

1. Diffusion with an absorbing state

​ 换言之，就是根据 $\beta_t$ 进行 mask.

1. Uniform diffusion

​ 也就是有 $1-\dfrac{K-1}{K}\beta_t$ 的概率保持不变，有 $\beta_t$ 的概率变成其他的token.

1. Discretized Gaussian transition matrices
2. Structured diffusion in text: using word-embedding distance to introduce locality

### Unleashing Transformers

Two stage training, the first stage is latent codes learning using VQVAE or VQ-GAN.

The second is BERT-like style training using absorbing state diffusion. Specifically, in each forward time step

The ELBO is parameterized as:

Generating High-Resolution Images allows globally consistent images substantially larger than those in the training data to be generated.

where

Improving Code Representations

### Vector Quantized Diffusion Model for Text-to-Image Synthesis

#### forward diffusion process

\begin{aligned} Q^t_{mn} = q(x_t=m|x_{t-1}=n) \in R^{K\times K} \\ q(x_t|x_{t-1}) = v^T(x_t)Q_tv(x_{t-1}) \in R \end{aligned}

$v_t \sim R^{K\times 1}$ is one-hot vector. $v_t$ is given by $Q_{t}v_{t-1}$.

[Structured ddm] 提出的uniform noise to the categorical distribution可以写成：

• an image token may be replaced to an utterly uncorrelated category, which leads to an abrupt semantic change for that token.
• the network has to take extra efforts to figure out the tokens that have been replaced prior to fixing them.

• 被corrupted tokens可以轻易的被识别，比如被 [MASK] 替换掉的，这样可以加速reverse process.
• 相比单纯的被 [MASK] 替换，作者证明使用一定比例的uniform noise可以得到更 trivial 的posterior.

#### reverse denoising process

Note that since the transition matrix $Q_t$ is fixed in the train- ing, the $L_T$ is a constant number which measures the gap between the training and inference and can be ignored in the training.

Xie Pan

2022-04-20

2022-04-26