Attention Is All You Need

### 1. paper reading

#### 1.1 Introduction

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t.

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

RNN模型有两个很致命的缺点： $y_t=f(y_{t-1},x_t)$

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.

Attention机制能够有效解决RNN无法长时间依赖的问题，但是对于无法并行化计算的问题依旧存在。

#### 2. Background

##### 2.2 Self-attention

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

A structured self-attentive sentence embedding

##### 2.3 End-to-end memory networks

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.

End-to-end memory networks

##### 2.4 Transformer

Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

### Model Architecture

the encoder maps an input sequence of symbol representations $(x_1,...,x_n)$ to a sequence of continuous representations $z = (z_1,...,z_n)$. Given z, the decoder then generates an output sequence $(y_1,...,y_m)$ of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.

Transformer 也是由 encoder 和 decoder 组成。

#### Decoder

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

#### Attention

Really love this short description of attention:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

##### Scaled Dot-Product Attention
• queries: $Q\in R^{n\times d_k}$
• keys: $K\in R^{n\times d_k}$
• values: $V\in R^{n\times d_v}$

$q\cdot k=\sum_{i=1}^{d_k}q_ik_i$

$d_k$ 很大时，$q\cdot k$ 的方差也会很大。

### Encoder

#### Stage1

##### Training data and batching

WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs.

• 这里的sentence_len 指的是源语言句子的最大长度和目标语言句子的最大长度。长度不足的需要zero padding.

• decoder 中self-attention的 query, keys, values 都是相同的，初始值是随机初始化的，shape 与 self.input_y 一致即可。

##### Embedding

we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$. In our model, we share the same weight matrix between the two embedding layers.

• 通常embedding我们在写的参数输入 vocab_size 和 num_units(也就是 embed_size)，但机器翻译中设计到两种语言，直接定义一个函数，并将input作为输入会让程序更简洁吧。。

• 这里将vocabulary 中index=0的设置为 constant 0, 也就是作为 input 中的 zero padding 的词向量。

• 归一化，除以 np.sqrt(num_units). 不懂为何要这么做？有论文研究过吗？

##### position encoding

pos　是word在句子中的位置， i 是对应 $d_{model}$ 词向量中的第 i 维。

That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π.

In RNN (LSTM), the notion of time step is encoded in the sequence as inputs/outputs flow one at a time. In FNN, the positional encoding must be preserved to represent the time in some way to preserve the positional encoding. In case of the Transformer authors propose to encode time as sine wave, as an added extra input. Such signal is added to inputs and outputs to represent time passing.

In general, adding positional encodings to the input embeddings is a quite interesting topic. One way is to embed the absolute position of input elements (as in ConvS2S). However, authors use "sine and cosine functions of different frequencies". The "sinusoidal" version is quite complicated, while giving similar performance to the absolute position version. The crux is however, that it may allow the model to produce better translation on longer sentences at test time (at least longer than the sentences in the training data). This way sinusoidal method allows the model to extrapolate to longer sequence lengths.

#### Stage2

##### scaled dot-product attention

$Attention(Q,K,V)=softmax\dfrac{QK^T}{\sqrt d_k}V$

Transformer reduces the number of operations required to relate (especially distant) positions in input and output sequence to a O(1). However, this comes at cost of reduced effective resolution because of averaging attention-weighted positions.

- h = 8 attention layers (aka “heads”): that represent linear projection (for the purpose of dimension reduction) of key K and query Q into $d_k$-dimension and value V into $d_v$-dimension:

$head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i) , i=1,\dots,h$

$W^Q_i, W^K_i\in\mathbb{R}^{d_{model}\times d_k}, W^V_i\in\mathbb{R}^{d_{model}\times d_v}, for\ d_k=d_v=d_{model}/h = 64$

• scaled-dot attention applied in parallel on each layer (different linear projections of k,q,v) results in $d_v$-dimensional output.

• concatenate outputs of each layer (different linear projection; also referred as ”head”): Concat$(head_1,…,head_h)$

• linearly project the concatenation result form the previous step: $MultiHeadAttention(Q,K,V) = Concat(head_1,\dots,head_h) W^O$ where $W^0\in\mathbb{R}^{d_{hd_v}\times d_{model}}$

##### 关于 attention 在模型中的应用，有三种情况
• 1.In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder。

• 2.The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

• 3.Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −1) all values in the input of the softmax which correspond to illegal connections.

Transformer 中的attention机制总共有三种情况：

• 1.encoder模块中的 self-attention，其中 queries, keys, values 都是来自 input_x, 也就是源语言的词表示。通过多层 multi-head attention, FFN, 得到最后的 input sentence 的向量表示，在没有使用RNN，CNN的情况下，其中的每个词都包含了其他所有词的信息，而且效果比 RNN，CNN 得到的向量表示要好。

• 2.encoder-encoder模块中的 attention. 其中 queries 来自上一个sub-layer, 也就是 decoder 中 masked multi-head attention 的输出，keys-values 来自 encoder 的输出。

• 3.decoder模块中的 self-attention，其中 queries, keys, values 都是来自于上一个 decoder 的输出。

#### Stage3: Position-wise Feed-Forward Networks

$FFN(x) = MAX(0, xW_1+b_1)W_2+b_2$

### Decoder

decoder 模块中 self-attention 的初始输入：

#### Regularization

##### label smoothing

During training, we employed label smoothing of value ls = 0:1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

Reference: