# 论文笔记 Deep Transition Architecture

## paper 1

paper: Self-Attention: A Better Building Block for Sentiment Analysis Neural Network Classifiers, 2018 WASSA@EMNLP

• Sinusoidal Position Encoding

• Learned Position Encoding

• Relative Position Representations

Sinusoidal 是在 Transformer 中使用的, 好处在于即使是测试集中出现 sentence 的长度比训练集中所有的 sentence 都要长，也能计算其 position encoding.

Relative 是效果最好的,作者和 Tansformer 的作者是一样的，值得一看。Self-attention with relative position representations

For this method, the self-attention mechanism is modified to explicitly learn the relative positional information between every two sequence positions. As a result, the input sequence is modeled as a labeled, directed, fully-connected graph, where the labels represent positional information. A tunable parameter k is also introduced that limits the maximum distance considered between two sequence positions. [Shaw et al., 2018] hypothesized that this will allow the model to generalize to longer sequences at test time.

## paper 2  • encoder transition

• query transition

• decoder transition

### GRU

$$h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t$$

candidate state:

$$\tilde h_t = tanh(W_{xh}x_t + r_t\odot (W_{hh}h_{t-1}))$$

reset gate:

$$r_t = \sigma(W_{xr}x_t+W_{hr}h_{t-1})$$

update gate:

$$z_t=\sigma(W_{xz}x_t+W_{hz}h_{t-1})$$

### T-GRU (transition GRU)

$$h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t$$

candidate state:

$$\tilde h_t = tanh(r_t\odot (W_{hh}h_{t-1}))$$

reset gate:

$$r_t = \sigma(W_{hr}h_{t-1})$$

update gate:

$$z_t=\sigma(W_{hz}h_{t-1})$$

### L-GRU( Linear Transformation enhanced GRU)

$$h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t$$

candidate state:

$$\tilde h_t = tanh(W_{xh}x_t + r_t\odot (W_{hh}h_{t-1}))+ l_t\odot H(x_t)$$

$$H(x_t)=W_xx_t$$

$$l_t=\sigma(W_{xl}x_t+W_{hl}h_{t-1})$$

$\tilde h_t = tanh(W_{xh}x_t +W_{hh2}h_{t-1} +l_t\odot W_{xh2}x_t+ r_t\odot (W_{hh}h_{t-1}))$

### DNMT

#### Decoder

$L_s$ 表示 encoder transition 的深度 depth. $j$ 表示 current time step.

$$\overrightarrow h_{j,0}=L-GRU(x_j, \overrightarrow h_{j-1,L_s})$$

$$\overrightarrow h_{j,k}=T-GRU(\overrightarrow h_{j, k-1}),\text{ for } 1\le k\le L_s$$

#### Decoder • query transition: depth $L_q$

• decoder transition: depth $L_d$   ### Tricks # 论文笔记, Attention Is All You Need

Attention Is All You Need

#### 1.1 Introduction

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t.

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

RNN模型有两个很致命的缺点：

$$y_t=f(y_{t-1},x_t)$$

recurrent nets: the difficulty of learning long-term dependencies](http://www.bioinf.jku.at/publications/older/ch7.pdf)）；

• 二是计算无法并行化的问题（后一个时刻的计算依赖于前一个时刻的计算），导致训练速度很慢。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.

Attention机制能够有效解决RNN无法长时间依赖的问题，但是对于无法并行化计算的问题依旧存在。

#### 2. Background

##### 2.2 Self-attention

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

A structured self-attentive sentence embedding

##### 2.3 End-to-end memory networks

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.

End-to-end memory networks

##### 2.4 Transformer

Transformer is the first transduction model relying

entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

### Model Architecture

the encoder maps an input sequence of symbol representations $(x_1,…,x_n)$ to a sequence of continuous representations $z = (z_1,…,z_n)$. Given z, the decoder then generates an output sequence $(y_1,…,y_m)$ of symbols one element at a time. At each step the model is auto-regressive , consuming the previously generated symbols as additional input when generating the next.

Transformer 也是由 encoder 和 decoder 组成。 #### Decoder

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

#### Attention

Really love this short description of attention:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

##### Scaled Dot-Product Attention • queries: $Q\in R^{n\times d_k}$

• keys: $K\in R^{n\times d_k}$

• values: $V\in R^{n\times d_v}$  $q\cdot k=\sum_{i=1}^{d_k}q_ik_i$ ### Components and Training ### Encoder

#### Stage1

##### Training data and batching

WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. • decoder 中self-attention的 query, keys, values 都是相同的，初始值是随机初始化的，shape 与 self.input_y 一致即可。
##### Embedding

we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$. In our model, we share the same weight matrix between the two embedding layers.

• 通常embedding我们在写的参数输入 vocab_size 和 num_units(也就是 embed_size)，但机器翻译中设计到两种语言，直接定义一个函数，并将input作为输入会让程序更简洁吧。。
• 这里将vocabulary 中index=0的设置为 constant 0, 也就是作为 input 中的 zero padding 的词向量。
• 归一化，除以 np.sqrt(num_units). 不懂为何要这么做？有论文研究过吗？
##### position encoding pos　是word在句子中的位置， i 是对应 $d_{model}$ 词向量中的第 i 维。

That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π.

The Transformer – Attention is all you need.

In RNN (LSTM), the notion of time step is encoded in the sequence as inputs/outputs flow one at a time. In FNN, the positional encoding must be preserved to represent the time in some way to preserve the positional encoding. In case of the Transformer authors propose to encode time as sine wave, as an added extra input. Such signal is added to inputs and outputs to represent time passing.

In general, adding positional encodings to the input embeddings is a quite interesting topic. One way is to embed the absolute position of input elements (as in ConvS2S). However, authors use “sine and cosine functions of different frequencies”. The “sinusoidal” version is quite complicated, while giving similar performance to the absolute position version. The crux is however, that it may allow the model to produce better translation on longer sentences at test time (at least longer than the sentences in the training data). This way sinusoidal method allows the model to extrapolate to longer sequence lengths. #### Stage2

##### scaled dot-product attention

$$Attention(Q,K,V)=softmax\dfrac{QK^T}{\sqrt d_k}V$$

Transformer reduces the number of operations required to relate (especially distant) positions in input and output sequence to a O(1). However, this comes at cost of reduced effective resolution because of averaging attention-weighted positions. • h = 8 attention layers (aka “heads”): that represent linear projection (for the purpose of dimension reduction) of key K and query Q into $d_k$-dimension and value V into $d_v$-dimension:

$$head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i) , i=1,\dots,h$$

$$W^Q_i, W^K_i\in\mathbb{R}^{d_{model}\times d_k}, W^V_i\in\mathbb{R}^{d_{model}\times d_v}, for\ d_k=d_v=d_{model}/h = 64$$

• scaled-dot attention applied in parallel on each layer (different linear projections of k,q,v) results in $d_v$-dimensional output.
• concatenate outputs of each layer (different linear projection; also referred as ”head”): Concat$(head_1,…,head_h)$
• linearly project the concatenation result form the previous step:

$$MultiHeadAttention(Q,K,V) = Concat(head_1,\dots,head_h) W^O$$

where $W^0\in\mathbb{R}^{d_{hd_v}\times d_{model}}$

##### 关于 attention 在模型中的应用，有三种情况
• 1.In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder。
• 2.The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
• 3.Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this

inside of scaled dot-product attention by masking out (setting to −1) all values in the input of the softmax which correspond to illegal connections.

Transformer 中的attention机制总共有三种情况：

• 1.encoder模块中的 self-attention，其中 queries, keys, values 都是来自 input_x, 也就是源语言的词表示。通过多层 multi-head attention, FFN, 得到最后的 input sentence 的向量表示，在没有使用RNN，CNN的情况下，其中的每个词都包含了其他所有词的信息，而且效果比 RNN，CNN 得到的向量表示要好。
• 2.encoder-encoder模块中的 attention. 其中 queries 来自上一个sub-layer, 也就是 decoder 中 masked multi-head attention 的输出，keys-values 来自 encoder 的输出。
• 3.decoder模块中的 self-attention，其中 queries, keys, values 都是来自于上一个 decoder 的输出。

#### Stage3: Position-wise Feed-Forward Networks

$$FFN(x) = MAX(0, xW_1+b_1)W_2+b_2$$

### Decoder

decoder 模块中 self-attention 的初始输入：

#### Optimizer #### Regularization

##### label smoothing

During training, we employed label smoothing of value ls = 0:1 . This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

Reference: