# 论文笔记 Pointer Networks and copy mechanism

paper:

## Pointer Network

### Motivation

We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence.

Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence [1] and Neural Turing Machines [2], because the number of target classes in each step of the output depends on the length of the input, which is variable.

Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class.

It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output.

We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems – finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem – using training examples

alone.

Ptr-Net 可以用来学习类似的三个几何问题。

Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on.

Ptr-Net 不仅可以提升 seq2seq with attention,而且能够泛化到变化的 dictionayies.

• 一是，简单的 copy 在传统的方法中很难实现，而 Ptr-Net 则是直接从输入序列中生成输出序列。

• 而是，可以解决输出 dictionary 是变化的情况。普通的 Seq2Seq 的 output dictionary 大小是固定的，对输出中包含有输入单词(尤其是 OOV 和 rare word) 的情况很不友好。一方面，训练中不常见的单词的 word embedding 质量也不高，很难在 decoder 时预测出来，另一方面，即使 word embedding 很好，对一些命名实体，像人名等，word embedding 都很相似，也很难准确的 reproduce 出输入提到的单词。Point Network 以及在此基础上后续的研究 CopyNet 中的 copy mechanism 就可以很好的处理这种问题，decoder 在各 time step 下，会学习怎样直接 copy 出现在输入中的关键字。

### Model Architecture

#### sequence-to-sequence Model

$$p(C^P|P;\theta)=\sum_{i=1}^m(P)p_{\theta}(C_i|C_1,…,C_{i-1},P;\theta)$$

$$\theta^* = {argmax}{\theta}\sum{P,C^P}logp(C^P|P;\theta)$$

In this sequence-to-sequence model, the output dictionary size for all symbols $C_i$ is fixed and equal to n, since the outputs are chosen from the input. Thus, we need to train a separate model for each n. This prevents us from learning solutions to problems that have an output dictionary with a size that depends on the input sequence length.

#### Content Based Input Attention

This model performs significantly better than the sequence-to-sequence model on the convex hull problem, but it is not applicable to problems where the output dictionary size depends on the input.

Nevertheless, a very simple extension (or rather reduction) of the model allows us to do this easily.

#### Ptr-Net

seq2seq 模型的输出词是在固定的 dictionary 中进行 softmax，并选择概率最大的词，从而得到输出序列。但这里的输出 dictionary size 是取决于 input 序列的长度的。所以作者提出了新的模型，其实很简单。

$$u_j^i=v^Ttanh(W_1e_j+W_2d_i) ，j\in(1,…,n)$$

$$p(C_i|C_1,…,C_{i-1},P)=softmax(u^i)$$

i 表示decoder 的时间步，j 表示输入序列中的index. 所以$e_j$ 是 encoder 编码后的隐藏向量，$d_i$ 是 decoder 当前时间步 i 的隐藏向量。跟一般的 attention 基本上一致。只不过得到的 softmax 概率应用在输入序列 $C_1,…,C_{i-1}$ 上。

## CopyNet

### Motivation

We address an important problem in sequence-to-sequence (Seq2Seq) learning referred to as copying, in which certain segments in the input sequence are selectively replicated in the output sequence. A similar phenomenon is observable in human language communication. For example, humans tend to repeat entity names or even long phrases in conversation.

The challenge with regard to copying in Seq2Seq is that new machinery is needed to decide when to perform the operation.

For example:

• What to copy: 输入中的哪些部分应该被 copy?

• Where to paste: 应该把这部分信息 paste 到输出的哪个位置？

### Model Architecture

• From a cognitive perspective, the copying mechanism is related to rote memorization, requiring less understanding but ensuring high literal fidelity. 从认知学角度，copy机制近似于死记硬背，不需要太多的理解，但是要保证文字的保真度。

• From a modeling perspective, the copying operations are more rigid and symbolic, making it more difficult than soft attention mechanism to integrate into a fully differentiable neural model. 从模型的角度，copy 操作更加死板和符号化，这也使得相比 soft attention 机制更难整合到一个完整的可微分的神经模型中去。

Encoder:

LSTM 将 source sequence 转换为隐藏状态 M(emory) $h_1,…,h_{T_S}$.

Decoder:

• Prediction: COPYNET predicts words based on a mixed probabilistic model of two modes, namely the generate-mode and the copymode, where the latter picks words from the source sequence. 下一个词的预测由两种模式混合而成。生成 generate-mode 和 copy-mode. 后者就像前面 Ptr-Net 所说的，在 source sentence 获取词。
• State Update: the predicted word at time t−1 is used in updating the state at t, but COPYNET uses not only its word-embedding but also its corresponding location-specific hidden state in M (if any). 更新 decoder 中的隐藏状态时，t 时间步的隐藏状态不仅与 t-1 步生成词的 embedding vector 有关，还与这个词对应于 source sentence 中的隐藏状态的位置有关。

#### Prediction with Copying and Generation:$s_t\rightarrow y_t$

$$p(y_t|s_t,y_{t-1},c_t,M)=p(y_t,g|s_t,y_{t-1},c_t,M) + p(y_t,c|s_t,y_{t-1},c_t,M)$$

• Content-base

• location-base

Selective read from location-specific hidden units

$$p(y_t,g|\cdot)=\begin{cases} \dfrac{1}{Z}e^{\psi_g(y_t)}&y_t\in V\ 0,&y_t\in X \bigcap \overline V\ \dfrac{1}{Z}e^{\psi_g(UNK)},&y_t\notin V\cup X \end{cases}$$

$$p(y_t,c|\cdot)=\begin{cases}\dfrac{1}{Z}\sum_{j:x_j=y_t}{e^{\psi_c(x_j)}},&y_t\in X\0&\text {otherwise}\end{cases}$$

Z 是两种模型共享的归一化项，$Z=\sum_{v\in V\cup{UNK}}e^{\psi_g(v)}+\sum_{x\in X}e^{\psi_c(x)}$.

Generate-Mode:

$$\psi_g(y_t=v_i)=\nu_i^TW_os_t, v_i\in V\cup UNK$$

• $W_o\in R^{(N+1)\times d_s}$

• $\nu_i$ 是 $v_i$ 对应的 one-hot 向量. 得到的结果是当前词的概率。

generate-mode 的 score $\psi(y_t=v_i)$ 和普通的 encoder-decoder 是一样的。全链接之后的 softmax.

copy-mode:

$$\psi(y_t=x_j)=\sigma(h_j^TW_c)s_t,x_j\in \mathcal{V}$$

• $h_j$ 是 encoder hidden state. j 表示输入序列中的位置。

• $W_c\in R^{d_h\times d_s}$ 将 $h_j$ 映射到跟 $s_t$ 一样的语义空间。

• 作者发现使用 tanh 非线性变换效果更好。同时考虑到 $y_t$ 这个词可能在输入中出现多次，所以需要考虑输入序列中所有的为 $y_t$ 的词的概率的类和。

#### state update

$$c_t=\sum_{\tau=1}^{T_S}\alpha_{t\tau}$$

$$\alpha_{t\tau}=\dfrac{e^{\eta(s_{t-1},h_{\tau})}}{\sum_{\tau’}e^{\eta(s_{t-1},h_{\tau’})}}$$

CopyNet 的 $y_{t-1}$ 在这里有所不同。不仅仅考虑了词向量，还使用了 M 矩阵中特定位置的 hidden state，或者说，$y_{t−1}$ 的表示中就包含了这两个部分的信息 $[e(y_{t−1});\zeta(y_{t−1})]$，$e(y_{t−1})$ 是词向量，后面多出来的一项 $\zeta(y_{t−1})$ 叫做 selective read, 是为了连续拷贝较长的短语。和attention 的形式差不多，是 M 矩阵中 hidden state 的加权和.

$$\zeta(y_{t-1})=\sum_{\tau=1}^{T_S}\rho_{t\tau}h_{\tau}$$

$$\rho_{t\tau}=\begin{cases}\dfrac{1}{K}p(x_{\tau},c|s_{t-1},M),& x_{\tau}=y_{t-1}\ 0,& \text{otherwise} \end{cases}$$

• 当 $y_{t-1}$ 没有出现在 source sentence中时， $\zeta(y_{t-1})=0$.

• 这里的 $K=\sum{\tau’:x_{\tau’}=y_{t-1}}p(x_{\tau’},c|s_{t-1},M)$ 是类和。还是因为输入序列中可能出现多个当前词，但是每个词在 encoder hidden state 的向量表示是不一样的，因为他们的权重也是不一样的。

• 这里的 p 没有给出解释，我猜跟前面计算 copy 的 score 是一致的？

• 直观上来看，当 $\zeta(y_{t-1})$ 可以看作是选择性读取 M (selective read). 先计算输入序列中对应所有 $y_{t-1}$ 的权重，然后加权求和，也就是 $\zeta(y_{t-1})$.

$$\zeta(y_{t-1}) \longrightarrow{update} \ s_t \longrightarrow predict \ y_t \longrightarrow sel. read \zeta(y_t)$$

### Learning

$$L=-\dfrac{1}{N}\sum_{k=1}^N\sum_{t=1}^Tlog[p(y_t^{(k)}|y_{<t}^{(k)}, X^{(k)})]$$

N 是batch size，T 是 object sentence 长度。

http://www.panxiaoxie.cn/2018/08/25/论文笔记-Pointer-Networks/

Xie Pan

2018-08-25

2021-06-29