cs224d-lecture10 机器翻译和注意力机制

主要内容:

  • Seq2Seq 基础模型
  • seq2seq encoder and decoder
  • Attention Mechanisms: 介绍了三种attention
  • Bahdanau et al. NMT model: 重点是怎么计算 context vector \(c_i\)

前言

对于NER标注,是通过previous words预测下一个word.还有一类NLP任务,是针对sequential output or outputs that are sequences of potentially varying length. 比如:

  • Translation: taking a sentence in one language as input and outputting the same sentence in another language

  • Conversation: taking a statement or question as input and responding to it.

  • Summarization: taking a large body of text as input and outputting a summary of it.

这一节内容讲的就是根据一个基于深度学习的框架sequence-tosequence模型,用来处理序列生成的问题。

序列生成的历史方法:

  • word-based system: 无法capture句子中的词序
  • phrase-based system: 无法解决长时间依赖的问题
  • Seq2Seq模型:can generate arbitrary output sequences after seeing the entire input. They can even focus in on specific parts of the input automatically to help generate a useful translation.

sequence-to-sequence Basics

Sutskever et al. 2014, "Sequence to Sequence Learning with Neural Networks"

Seq2Seq-encoder

encoder的目的就是将input sentence读入到模型中,并生成一个固定维度 context vector C. 显然,就一个任意长度的句子的信息压缩到一个固定维度的向量中,这是很困难的。所以encoder通常使用 stacked LSTMs.

通常会将sentence翻转作为输入,以机器翻译为例,翻转后输入的最后一个词对应的翻译,也就是output的第一个词。

明显感觉效果会不太好对吧,所以也就有了后来的attention

举个例子:

input sentence:"what is your name"

那么得到的context vector 就是 a vector space representation of the notion of asking someone for their name.

Seq2Seq-decoder

decoder的目的是生成sentence,在最上面一层LSTM上接着softmax用来生成当前时间步的output词。然后用这个词作为下一个时间步的input word.

一旦得到output sentence,通过最小化交叉熵损失函数,来训练encoder和decoder中的参数。

Bidirectional RNNs

Attention Mechanism

Motivation

在seq2seq模型中,使用单一的 context vector:different parts of an input have different levels of significance. Moreover, different parts of the output may even consider different parts of the input "important."

  • 也就是说输入sentence中,每个词并不是具有同样的重要程度的。比如 "the ball is on the field",显然"ball" "on" "field"比较重要。

  • 而且,在输出的某一部分也可能更看中input中的某一部分。通常output中的前几个词主要却取决于input中的前几个词,output中的后几个词主要取决于input中的后几个词。

那么Attention mechanisms采用的方法是: providing the decoder network with a look at the entire input sequence at every decoding step; the decoder can then decide what input words are important at any point in time. 在decoder时采用注意力机制,确定在任何时刻生成词时输入序列中每个词的权重。

Bahdanau et al. NMT model

原论文: Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate

Decoder: General description

decoder中生成下一个词的条件概率:

\[P(y_i|y_1,...,y_{i-1},X)=g(y_{i-1},s_i,c_i)\]

其中,当前时间步的隐藏状态 \(s_i\) 表示为:

\[s_i=f(s_{i-1},y_{i-1},c_i)\]

也就是: i时刻生成此 \(y_i\) 取决于 上一个生成词 \(y_{i-1}\) (在生成序列时上一个时间步的输出是下一个时间步的输入) 和 i-1时刻的隐藏状态 \(s_{i-1}\) 以及context vector对应的值 \(c_i\).

重点是怎么计算每个时间步的context vector \(c_i\)

在标准的seq2seq模型中,context vector只有一个,但在attention模型中,每个时间步都有单独的context vector \(c_i\),它依赖于输入序列中的所有annotation \((h_1; · ·· ; h_{T_x})\),并赋予他们一定的权重。也就是: \[c_i=\sum_{j=1}^{T_x}\alpha_{ij}h_j\]

其中i表示输出序列的第i时刻,j表示输入序列的第j个word的annotation.

其中对于每个输入词的annotation即 \(h_j\) 的权重 \(a_{ij}\) 是这么计算的: \[\alpha_{ij}=\dfrac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})}\]

其中: \[e_{ij}=a(s_{i-1},h_j)\]

是对其模型(alignment model),\(s_{i-1}\) 表示输出序列的隐藏状态, \(h_j\) 表示输入序列的隐藏状态,所以用来计算输入sentence中第j个位置和输出序列中第i个位置匹配的得分(score).这个得分是基于decoder中的前一个时刻的隐藏状态 \(s_{i-1}\), 和输入序列中的第j个annotation \(h_j\)。 a是任意函数,且得到的值是R。比如可以是一个单层的全连接神经网络,得到了序列 \(e_{i,1},...,e_{i,n}\), 然后使用softmax得到 \(\alpha_i=(\alpha_{i,1},...,\alpha_{i,n})\).

疑问:知道了 \(e_{ij}\) 的意义,但不太明白怎么计算。。还没看论文,猜想既然a是任意函数,那么应该就是用神经网络来表示了。

总结下来:

Let \(\alpha_{ij}\) be a probability that the target word yi is aligned to, or translated from, a source word \(x_j\). Then, the i-th context vector \(c_i\) is the expected annotation over all the annotations with probabilities \(\alpha_{ij}\).

所以 \(\alpha_{ij}\) 表示的就是从词 \(x_j\) translate to (or align to) 词 \(y_i\) 的概率。也就是说 output中第i个词 \(y_i\) 可能由 input中的任意一个词对齐而来的,也不一定就是翻译,就是 翻译 \(y_i\) 的时候,input中每一个对它的影响程度,也就是这个权重值 \(\alpha_{ij}\).

而这个概率 \(\alpha_{ij}\) 以及其 associated energy \(e_{ij}\) 反映了 \(s_{i-1}\)\(h_j\) 对生成下一个word的重要性。

疑问:在训练的时候可以通过反向传播,得到参数 \(c_i\),但是这个权重也只能用于当前的序列吧。。测试的时候,这些训练的参数还能用么?

Encoder: bidirectional RNN for annotation sequences

在encoder时,将输入序列编码为annotation \((h_1,h_2,...,h_{T_x})\). 为了既考虑preceding words,又考虑 following words,采用双向RNN(BiRNN).

forward RNN \(\overrightarrow f\) reads input sentence (from \(x_1\) to \(x_{T_x}\)): \[(\overrightarrow h_1,...,\overrightarrow h_{T_x})\]

backward RNN \(\overleftarrow f\) reads input sentence in the reverse order (from \(x_{T_x}\) to \(x_1\)): \[(\overleftarrow h_1,...,\overleftarrow h_{T_x})\]

annotation for \(x_j\): \[h_j=[\overrightarrow {h_{T_j}^T},\overleftarrow {h_{T_j}^T}]\]

Connection with translation alignment

在训练的decoder过程中,可以得到这样的一个alignment table, a table mapping words in the source to corresponding words in the target sentence.使用attention score \(\alpha_{i,j}\) 填充这个表格。

这就解决了之前的疑惑了,在测试的时候,context vector的权重 \(\alpha_i\) 直接通过查表得到~

Huong et al. NMT model

原论文: Effective Approaches to Attentionbased Neural Machine Translation by Minh-Thang Luong, Hieu Pham an Christopher D. Manning

Global attention

encoder 隐藏状态序列: \(h_1,...,h_n\) ,n表示序列长度

decoder 隐藏状态序列: \(\overline h_1,...,\overline h_n\)

对于每一个decoder中的隐藏状态 \(\overline h_i\),计算其基于所有encoder隐藏状态的attention vector \(c_i\).

\[ score(h_i^T\overline h_j)=\begin{cases} h_i^T\overline h_j \\ h_i^TW\overline h_j \quad & \text{$\in R$}\\ W[h_i,\overline h_j] \end{cases} \]

类似于Bahdanau et al. NMT model中的 \(e_{ij}\), 同样的需要得到的权重 \(\alpha_{ij}\) 是概率,也就是encoder中的 \(h_i\) 与 decoder中的 \(h_j\) 匹配的概率, attention vector \(\alpha_{i,j}\)\[\alpha_{i,j}=\dfrac{exp(score(h_i^T\overline h_j))}{\sum_{k=1}^nexp(score(h_k^T\overline h_j))}\]

那么context vector: \[c_i=\sum_{j=1}^n \alpha_{i,j}h_j\]

那么使用context vector和隐藏状态 \(\overline h_i\) 生成新的decoder中第i时间步的新的 vector \[\tilde h_i=f[\overline h_i,c_i]\]

Local Attention

the model predicts an aligned position in the input sequence. Then, it computes a context vector using a window centered on this position. The computational cost of this attention step is constant and does not explode with the length of the sentence.

window怎么选?? Christopher 好像说用强化学习。。

Google’s new NMT

6 Johnson et el. 2016, "Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation"

The new multilingual model not only improved their translation performance, it also enabled "zero-shot translation," in which we can translate between two languages for which we have no translation training data. For instance, if we only had examples of Japanese-English translations and Korean-English translations, Google’s team found that the multilingual NMT system trained on this data could actually generate reasonable Japanese-Korean translations. The powerful implication of this finding is that part of the decoding process is not language-specific, and the model is in fact maintaining an internal representation of the input/output sentences independent of the actual languages involved.

More advanced papers using attention

  • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention by Kelvin Xu, Jimmy Lei Ba,Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel and Yoshua Bengio. This paper learns words/image alignment.

  • Modeling Coverage for Neural Machine Translation by Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu and Hang Li. Their model uses a coverage vector that takes into account the attention history to help future attention.

  • Incorporating Structural Alignment Biases into an Attentional Neural Translation Model by Cohn, Hoang, Vymolova, Yao, Dyer, Haffari. This paper improves the attention by incorporating other traditional linguistic ideas.

Sequence model decoders

使用统计的方法找到最合适的sequence \(\hat s*\)\[\hat s* = argmax_{\hat s}(P(\hat s|s))\]

  • Exhaustive search: NP问题

  • Ancestral sampling \[x_t \sim P(x_t|x_1,..,x_n)\]

  • Greedy search \[x_t=argmax_{\tilde x_t}P(\tilde x_t|x_1,...,x_n)\] 如果其中一步错了,对接下来的序列影响很大。

  • Beam search: the idea is to maintain K candidates at each time step.

Presentation

Google's Multilingual Neural Machine Translation System: Enabling zero-short Translation