# 为什么要学习 GAN？

• High-dimensional probability distributions, 从高维概率分布中训练和采样的生成模型具有很强的能力来表示高维概率分布。
• Reinforcement learning. 和强化学习结合。
• Missing data. 生成模型能有效的利用无标签数据，也就是半监督学习 semi-supervised learning。

# 生成模型如何工作的

## Maximum likehood estimation

$$\sum_i^mp_{\text{model}}(x^{(i)}; \theta)$$

m 是样本数量。

### 相对熵

p 是真实分布， q 是非真实分布。

$$H(p,q) = \sum_ip(i)log{\dfrac{1}{q(i)}}$$

$$H(p) = \sum_ip(i)log{\dfrac{1}{p(i)}}$$

$$D(p||q) = H(p,q)-H(p) = \sum_ip(i)log{\dfrac{p(i)}{q(i)}}$$

# GAN 是如何工作的？

## GAN 框架

• 判别器 discriminator

• 生成器 gererator

containing latent variables z and observed variables x.

z 是需要学习的隐藏变量，x 是可观察到的变量。

### Generator

differentiable function G, 可微分函数 G. 实际上就是 神经网络。z 来自简单的先验分布，G(z) 通过模型 $p_{model}$ 生成样本 x. 实践中，对于 G 的输入不一定只在第一层 layer, 也可以在 第二层 等等。总之，生成器的设计很灵活。

## cost function

### The discriminator’s cost, J (D)

$$\dfrac{p_{data}(x)}{p_{\text{model}}(x)}$$

GANs 通过监督学习来获得这个 ratio 的估计，这也是 GANs 不同于 变分自编码 和 波尔兹曼机 (variational autoencoders and Boltzmann machines) 的区别。

### Minimax, zero-sum game

$$J^{(G)} = -J^{(D)}$$

$$V(\theta^{(D)}, \theta^{(G)})=-J^{(D)}(\theta^{(D)}, \theta^{(G)})$$

outer loop 是关于 $\theta^{(G)}$ 的最小化，inner loop 是关于 $\theta^{(D)}$ 的最大化。

In practice, the players are represented with deep neural nets and updates are made in parameter space, so these results, which depend on convexity, do not apply

### Heuristic, non-saturating game

Minimizing the cross-entropy between a target class and a classifier’s predicted distribution is highly effective because the cost never saturates when the classifier has the wrong output.

$$J^{(G)}(\theta^{(D)}, \theta^{(G)})=-\dfrac{1}{2}E_zlogD(G(z))-\dfrac{1}{2}E_{x～p_{data}}log(1-D(x))$$

$$J^{(G)}(\theta^{(D)}, \theta^{(G)})=-\dfrac{1}{2}E_zlogD(G(z))$$

In the minimax game, the generator minimizes the log-probability of the discriminator being correct. In this game, the generator maximizes the logprobability of the discriminator being mistaken.

### Maximum likelihood game

$$J^{(G)}=-\dfrac{1}{2}E_zexp(\sigma^{-1}(D(G(z))))$$

in practice, both stochastic gradient descent on the KL divergence and the GAN training procedure will have some variance around the true expected gradient due to the use of sampling (of x for maximum likelihood and z for GANs) to construct the estimated gradient.

### Is the choice of divergence a distinguishing feature of GANs?

#### Jensen-Shannon divergence， reverse KL

KL 散度并不是对称的。$D_{KL}(p_{data}||q_{model})$ 与 $D_{KL}(p_{model}||q_{data})$ 是不一样的。极大似然估计是前者，最小化 Jensen-Shannon divergence 则更像后者。

f-GAN 证明，KL 散度也能生成清晰的sample，并且也只选择少量的modes, 说明 Jensen-Shannon divergence 并不是 GANs 不同于其他模型的特征。

GANs 通常选择少量的 mode 来生成样本，这个少量指的是小于模型的能力。 而 reverse KL 则是选择更可能多的 mode of the data distribution 在模型能力范围内。它通常不会选择更少的 mode. 这也解释了 mode collapse 并不是散度选择的原因。

Altogether, this suggests that GANs choose to generate a small number of modes due to a defect in the training procedure, rather than due to the divergence they aim to minimize.

### Comparison of cost functions

$D(G(z))$ 表示 判别器 给 generate sample 为真的概率。

Maximum likelihood also suffers from the problem that nearly all of the gradient comes from the right end of the curve, meaning that a very small number of samples dominate the gradient computation for each minibatch. This suggests that variance reduction techniques could be an important research area for improving the performance of GANs, especially GANs based on maximum likelihood.

DCGAN 的结构。

## GAN，NCE， MLE 的对比

• MiniMax GAN 和 NCE 的 cost function 相同

• 更新策略不一样，GAN 和 MLE 都是梯度下降，而 MLE copies the density model learned inside the discriminator and converts it into a sampler to be used as the generator. NCE never updates the generator; it is just a fixed source of noise.

# Tips and Tricks

