# 从0开始GAN-3-文本生成planning

## Reward Augmented Maximum Likelihood for Neural Structured Prediction

### Motivation

#### Maximum likilihood based method

$$L_{ML}=\sum_{(x,y^* )\in D}-logp_{\theta}(y^* |x)$$

• Minimizing this objective increases the conditional probability of the target outputs, $logp_{\theta}p(y^* |x)$, while decreasing the conditional probability of alternative incorrect outputs. According to this objective, all negative outputs are equally wrong, and none is preferred over the others.

• Generating Sentences from a Continuous Space:However, by breaking the model structure down into a series of next-step predictions, the rnnlm does not expose an interpretable representation of global features like topic or of high-level syntactic properties.

#### RL based method

$$L_{RL}(\theta;\tau,D)=\sum_{(x,y^* )\in D}{-\tau\mathbb{H}(p_{\theta}(y^* |x))-\sum_{y\in \mathbb{Y}}p_{\theta}(y|x)r(y,y^* )}\quad{(1)}$$

D 表示 training parallel data. $\mathbb{H}(p)$ 表示概率分布 $p_{\theta}$ 对应的交叉熵, $H(p(y))=\sum_{y\in \mathbb{Y}p(y)logp(y)}$. $\tau$ 表示 temperature parameter，是一个超参。这个公式的理解可以与上一篇blog中seqgan的公式对应起来：

$$J(\theta)=E[R_T|s_0,\theta]=\sum_{y_1\in V}G_{\theta}(y_1|s_0)\cdot Q_{D_{\phi}}^{G_{\theta}}(s_0,y_1)\quad{(2)}$$

（1）式中的第2项就是（2）式。那么（1）式中的第一项表示的是Maximum likilihood的交叉熵？

• 使用随机梯度下降SGD来优化 $L_{RL}(\theta;\tau)$ 非常困难，因为reward对应的gradients的方差很大(large variance).

• 没能有效利用到监督信息。

#### RAMI

$$q(y|y^* ;\tau)=\dfrac{1}{Z(y^* ,\tau)}exp{r(y, y^* )/\tau}\quad(3)$$

Note that the temperature parameter, $\tau \ge 0$, serves as a hyper-parameter that controls the smoothness of the optimal distribution around correct targets by taking into account the reward function in the output space.

##### optimization

1. RL中sample得到的样本 y 是通过生成模型得到的，而且这个model还是不断进化的。这使得训练速度很慢，比如 seqgan 中的roll-out policy.

2. reward 在高维output空间非常稀疏，这使得优化很困难。

3. actor-critique methods.

##### Sampling from the exponentiated payoff distribution

• 给定的ground truth $y^*$ 长度是m

• 基于edit distance $y^*$ sample出与 $y^*$ 距离在 e 范围内的sentences y, 其中 $e\in {0,…,2m}$.

RL: x –> 通过decoder sample一个句子y’ –> 和y计算metric –> 把metric作为reward，算policy gradient

RAML: y –> 通过和metric对应的一个distribution sample一个句子y* –> 把y* 作为GT进行ML训练

RAML看起来几乎完美，不存在任何优化问题。可天下没有免费的午餐。RAML的难点在于如何将Metric转化成对应的distribution。RAML只提供了将诸如edit distance等metric转化成dist的方法，但对于BLEU等却无能为力。

##### 关于结构化预测related work

(a) supervised learning approaches that ignore task reward and use supervision;

(b) reinforcement learning approaches that use only task reward and ignore supervision;

(c) hybrid approaches that attempt to exploit both supervision and task reward.

## Generating Sentences from a Continuous Space Samuel

### Motivation

For this task, we introduce a novel evaluation strategy using an adversarial classifier, sidestepping the issue of intractable likelihood computations by drawing inspiration from work on non-parametric two-sample tests and adversarial training.

Xie Pan

2019-06-22

2021-06-29