• Seq2Seq 基础模型
• seq2seq encoder and decoder
• Attention Mechanisms: 介绍了三种attention
• Bahdanau et al. NMT model: 重点是怎么计算 context vector $c_i$

### 前言

• Translation: taking a sentence in one language as input and outputting the same sentence in another language

• Conversation: taking a statement or question as input and responding to it.

• Summarization: taking a large body of text as input and outputting a summary of it.

• word-based system: 无法capture句子中的词序
• phrase-based system: 无法解决长时间依赖的问题
• Seq2Seq模型：can generate arbitrary output sequences after seeing the entire input. They can even focus in on specific parts of the input automatically to help generate a useful translation.

### sequence-to-sequence Basics

Sutskever et al. 2014, "Sequence to Sequence Learning with Neural Networks"

#### Seq2Seq-encoder

encoder的目的就是将input sentence读入到模型中，并生成一个固定维度 context vector C. 显然，就一个任意长度的句子的信息压缩到一个固定维度的向量中，这是很困难的。所以encoder通常使用 stacked LSTMs.

#### Seq2Seq-decoder

decoder的目的是生成sentence，在最上面一层LSTM上接着softmax用来生成当前时间步的output词。然后用这个词作为下一个时间步的input word.

### Attention Mechanism

#### Motivation

• 也就是说输入sentence中，每个词并不是具有同样的重要程度的。比如 "the ball is on the field",显然"ball" "on" "field"比较重要。

• 而且，在输出的某一部分也可能更看中input中的某一部分。通常output中的前几个词主要却取决于input中的前几个词，output中的后几个词主要取决于input中的后几个词。

#### Bahdanau et al. NMT model

##### Decoder: General description

decoder中生成下一个词的条件概率：

$P(y_i|y_1,...,y_{i-1},X)=g(y_{i-1},s_i,c_i)$

$s_i=f(s_{i-1},y_{i-1},c_i)$

Let $\alpha_{ij}$ be a probability that the target word yi is aligned to, or translated from, a source word $x_j$. Then, the i-th context vector $c_i$ is the expected annotation over all the annotations with probabilities $\alpha_{ij}$.

##### Encoder: bidirectional RNN for annotation sequences

forward RNN $\overrightarrow f$ reads input sentence (from $x_1$ to $x_{T_x}$): $(\overrightarrow h_1,...,\overrightarrow h_{T_x})$

backward RNN $\overleftarrow f$ reads input sentence in the reverse order (from $x_{T_x}$ to $x_1$): $(\overleftarrow h_1,...,\overleftarrow h_{T_x})$

annotation for $x_j$: $h_j=[\overrightarrow {h_{T_j}^T},\overleftarrow {h_{T_j}^T}]$

### Huong et al. NMT model

#### Global attention

encoder 隐藏状态序列： $h_1,...,h_n$ ，n表示序列长度

decoder 隐藏状态序列： $\overline h_1,...,\overline h_n$

$score(h_i^T\overline h_j)=\begin{cases} h_i^T\overline h_j \\ h_i^TW\overline h_j \quad & \text{\in R}\\ W[h_i,\overline h_j] \end{cases}$

#### Local Attention

the model predicts an aligned position in the input sequence. Then, it computes a context vector using a window centered on this position. The computational cost of this attention step is constant and does not explode with the length of the sentence.

window怎么选？？ Christopher 好像说用强化学习。。

6 Johnson et el. 2016, "Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation"

The new multilingual model not only improved their translation performance, it also enabled "zero-shot translation," in which we can translate between two languages for which we have no translation training data. For instance, if we only had examples of Japanese-English translations and Korean-English translations, Google’s team found that the multilingual NMT system trained on this data could actually generate reasonable Japanese-Korean translations. The powerful implication of this finding is that part of the decoding process is not language-specific, and the model is in fact maintaining an internal representation of the input/output sentences independent of the actual languages involved.

### More advanced papers using attention

• Show, Attend and Tell: Neural Image Caption Generation with Visual Attention by Kelvin Xu, Jimmy Lei Ba,Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel and Yoshua Bengio. This paper learns words/image alignment.

• Modeling Coverage for Neural Machine Translation by Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu and Hang Li. Their model uses a coverage vector that takes into account the attention history to help future attention.

• Incorporating Structural Alignment Biases into an Attentional Neural Translation Model by Cohn, Hoang, Vymolova, Yao, Dyer, Haffari. This paper improves the attention by incorporating other traditional linguistic ideas.

### Sequence model decoders

• Exhaustive search: NP问题

• Ancestral sampling $x_t \sim P(x_t|x_1,..,x_n)$

• Greedy search $x_t=argmax_{\tilde x_t}P(\tilde x_t|x_1,...,x_n)$ 如果其中一步错了，对接下来的序列影响很大。

• Beam search： the idea is to maintain K candidates at each time step.

### Presentation

Google's Multilingual Neural Machine Translation System: Enabling zero-short Translation