论文笔记, Attention Is All You Need

Attention Is All You Need

1. paper reading

1.1 Introduction

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t.

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

RNN模型有两个很致命的缺点: \[y_t=f(y_{t-1},x_t)\]

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.

Attention机制能够有效解决RNN无法长时间依赖的问题,但是对于无法并行化计算的问题依旧存在。

2. Background

2.1 Extended Neural GPU [16], ByteNet [18] and ConvS2S [9]
2.2 Self-attention

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

A structured self-attentive sentence embedding

2.3 End-to-end memory networks

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.

End-to-end memory networks

2.4 Transformer

Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

本文主要和以下三篇文章对比:

Model Architecture

在以往的encoder-decoder模型中:

the encoder maps an input sequence of symbol representations \((x_1,...,x_n)\) to a sequence of continuous representations \(z = (z_1,...,z_n)\). Given z, the decoder then generates an output sequence \((y_1,...,y_m)\) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.

以往的attention虽然也能解决long dependecy的问题,但是受制于RNN的原因,每一步的计算必须在上一时间步完成后进行。因此无法并行计算。

Transformer 也是由 encoder 和 decoder 组成。

Encoder

其中 Encoder 由6个完全相同的layer堆叠(stack)而成。每一层layer由两个 sub-layer 组成,分别是 multi-head self-attention mechanismpoint-wise fully connected feed-forward network. 每一个 sub-layer 应用一个残差连接(residual connection),然后再连接一个 normalization 层。

Decoder

跟 encoder 非常类似,同样由6由6个完全相同的layer堆叠而成,但是每一层有3个 sub-layer, 增加了一个 multi-head attention. 同样的也有残差链接和normalization层。

对 self-attention 进行了修改,masking:

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

目的应该就是让下一个t时刻的生成词只依赖于t时刻之前的词。

Attention

Really love this short description of attention:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Scaled Dot-Product Attention
  • queries: \(Q\in R^{n\times d_k}\)
  • keys: \(K\in R^{n\times d_k}\)
  • values: \(V\in R^{n\times d_v}\)

计算向量內积作为相似度,并使用softmax计算权重,然后加权求和。 这种 attention 其实也很常见了,这里 google 算是给这种结构一个官方的名字吧。

论文中作者还对比了比较常用的另一种attention机制, additive attention (Neural Machine Translation by Jointly Learning to Align and Translate 这篇非常经典的文章中提出的),additive 是使用的前馈神经网络来计算 (具体公式可以看这里cs224d-lecture10 机器翻译和注意力机制). 虽然从计算复杂度上来讲,两者是差不多的,但在实际应用中 dot-product 更快,而且空间复杂度更低,因为可以通过矩阵优化计算。关于attention机制的对比可参考Massive Exploration of Neural Machine Translation Architectures

有一点需要注意的是,这里使用了归一化,也就是 Scaled. 当 \(d_k\) 很大时, additive attention 的效果要优于 dot-product attention, 作者怀疑(suspect)是当 \(d_k\) 太大时,通过 softmax 计算得到的权重都会很接近0或1,导致梯度很小。

\(q\cdot k=\sum_{i=1}^{d_k}q_ik_i\)

\(d_k\) 很大时,\(q\cdot k\) 的方差也会很大。

Multi-Head Attention

接下来按照整个模型的数据流过程来介绍模型中的每一个模块。

Components and Training

前面三部分 Encoder, Decoder, Attention 组成了 Transformer 模型的基本架构。其中具体细节,以及 Training 实现过程将通过代码实现。

Encoder

Stage1

Training data and batching

WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs.

作者使用了:

这里我将使用创新工厂举办的 challenge.ai 的中英文比赛数据:https://challenger.ai/datasets/translation

这里暂时先不管数据预处理,在模型中使用占位符 placeholder.

1
2
3
4
5
6
# add placeholders
self.input_x = tf.placeholder(dtype=tf.int32, shape=[None, self.sentence_len])
self.input_y = tf.placeholder(dtype=tf.int32, shape=[None, self.sentence_len])

# define decoder inputs
self.decoder_inputs = tf.concat([tf.ones_like(self.input_y[:,:1])*2, self.input_y[:,:-1]],axis=-1) # 2:<S>
  • 这里的sentence_len 指的是源语言句子的最大长度和目标语言句子的最大长度。长度不足的需要zero padding.

  • decoder 中self-attention的 query, keys, values 都是相同的,初始值是随机初始化的,shape 与 self.input_y 一致即可。

Embedding

we use learned embeddings to convert the input tokens and output tokens to vectors of dimension \(d_{model}\). In our model, we share the same weight matrix between the two embedding layers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def embedding(inputs,
vocab_size,
num_units,
zero_pad=True,
scale=True,
reuse=None):
"""

:param inputs: A `Tensor` with type `int32` or `int64` containing the ids
to be looked up in `lookup table`. shape is [batch, sentence_len]
:param vocab_size: vocabulary size
:param num_units: Number of embedding hidden units. in the paper, it is called d_model
:param zero_pad: If True, all the values of the fist row (id 0)
should be constant zeros.
:param scale: If True. the outputs is multiplied by sqrt num_units.
:param reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
:return:
A `Tensor` with one more rank than inputs's. The last dimensionality
should be `num_units`.

"""
with tf.variable_scope("embedding-layer", reuse=reuse):
embedding = tf.get_variable("embedding", [vocab_size, num_units],
initializer=xavier_initializer())
if zero_pad:
embedding = tf.concat([tf.zeros([1, num_units]),
embedding[1:, :]], axis=0) # index=0 for nil word
output = tf.nn.embedding_lookup(embedding, inputs) # [batch, sentence_len, num_units]
if scale:
output = output * np.sqrt(num_units)

return output
  • 通常embedding我们在写的参数输入 vocab_size 和 num_units(也就是 embed_size),但机器翻译中设计到两种语言,直接定义一个函数,并将input作为输入会让程序更简洁吧。。

  • 这里将vocabulary 中index=0的设置为 constant 0, 也就是作为 input 中的 zero padding 的词向量。

  • 归一化,除以 np.sqrt(num_units). 不懂为何要这么做?有论文研究过吗?

 position encoding

pos 是word在句子中的位置, i 是对应 \(d_{model}\) 词向量中的第 i 维。

That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π.

也就是说,位置编码的每个维度对应于正弦曲线。波长形成从2π到10000·2π的几何级数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def position_encoding_mine(n_position, d_model):
""" Init the sinusoid position encoding table.

:param n_position: the lenght of sentence
:param d_model: the same with embedding
:return:
"""
# keep dim -1 for padding token position encoding zero vector
# pos=-1 用于 padded zero vector
encoding = np.zeros([n_position, d_model], np.float32)
for pos in range(1, n_position):
for i in range(0, d_model):
encoding[pos, i] = pos /np.power(10000, 2.*i/d_model)

encoding[1:-2, 0::2] = np.sin(encoding[1:-2, 0::2]) # dim 2i
encoding[1:-2, 1::2] = np.cos(encoding[1:-2, 1::2]) # dim 2i+1
return encoding

def positional_encoding(inputs,
num_units,
zero_pad=True,
scale=True,
scope="positional_encoding",
reuse=None):
'''Sinusoidal Positional_Encoding.

Args:
inputs: A 2d Tensor with shape of (N, T).
num_units: Output dimensionality
zero_pad: Boolean. If True, all the values of the first row (id = 0) should be constant zero
scale: Boolean. If True, the output will be multiplied by sqrt num_units(check details from paper)
scope: Optional scope for `variable_scope`.
reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.

Returns:
A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
'''

N, T = inputs.get_shape().as_list() # N means batch_size, T means the sentence length.
with tf.variable_scope(scope, reuse=reuse):
position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1]) # [N, T]
# First part of the PE function: sin and cos argument
position_enc = np.array([
[pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]
for pos in range(T)]) # [T, num_units]

# Second part, apply the cosine to even columns and sin to odds.
position_enc[:, 0::2] = np.sin(position_enc[:, 0::2]) # dim 2i
position_enc[:, 1::2] = np.cos(position_enc[:, 1::2]) # dim 2i+1

# Convert to a tensor
lookup_table = tf.convert_to_tensor(position_enc, dtype=tf.float32)

if zero_pad:
lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
lookup_table[1:, :]), axis=0)
outputs = tf.nn.embedding_lookup(lookup_table, position_ind) # [N, T, num_units]

if scale:
outputs = outputs * num_units**0.5

return outputs

关于 position encoding 的解释,可以参考这篇blog The Transformer – Attention is all you need.

In RNN (LSTM), the notion of time step is encoded in the sequence as inputs/outputs flow one at a time. In FNN, the positional encoding must be preserved to represent the time in some way to preserve the positional encoding. In case of the Transformer authors propose to encode time as sine wave, as an added extra input. Such signal is added to inputs and outputs to represent time passing.

In general, adding positional encodings to the input embeddings is a quite interesting topic. One way is to embed the absolute position of input elements (as in ConvS2S). However, authors use "sine and cosine functions of different frequencies". The "sinusoidal" version is quite complicated, while giving similar performance to the absolute position version. The crux is however, that it may allow the model to produce better translation on longer sentences at test time (at least longer than the sentences in the training data). This way sinusoidal method allows the model to extrapolate to longer sequence lengths.

说真的,还是不太理解。。。

可视化 encoding 矩阵:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
import matplotlib.pyplot as plt

num_units = 100
sentence_len = 10

i = np.tile(np.expand_dims(range(num_units), 0), [sentence_len, 1]) # (100,)-> (1, 100) ->(10, 100)

pos = np.tile(np.expand_dims(range(sentence_len), 1), [1, num_units]) #(10,)-> (10, 1) -> (10, 100)

pos = np.multiply(pos, 1/10000.0)
i = np.multiply(i, 2.0/num_units)

matrix = np.power(pos, i)

matrix[:, 1::2] = np.sin(matrix[:, 1::2])
matrix[:, ::2] = np.cos(matrix[:, ::2])

im = plt.imshow(matrix, aspect='auto')
plt.show()

Stage2

scaled dot-product attention

\[Attention(Q,K,V)=softmax\dfrac{QK^T}{\sqrt d_k}V\]

Multi-head attention

Transformer reduces the number of operations required to relate (especially distant) positions in input and output sequence to a O(1). However, this comes at cost of reduced effective resolution because of averaging attention-weighted positions.

- h = 8 attention layers (aka “heads”): that represent linear projection (for the purpose of dimension reduction) of key K and query Q into \(d_k\)-dimension and value V into \(d_v\)-dimension:

\[head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i) , i=1,\dots,h\]

其中:

\[W^Q_i, W^K_i\in\mathbb{R}^{d_{model}\times d_k}, W^V_i\in\mathbb{R}^{d_{model}\times d_v}, for\ d_k=d_v=d_{model}/h = 64\]

  • scaled-dot attention applied in parallel on each layer (different linear projections of k,q,v) results in \(d_v\)-dimensional output.

  • concatenate outputs of each layer (different linear projection; also referred as ”head”): Concat\((head_1,…,head_h)\)

  • linearly project the concatenation result form the previous step: \[MultiHeadAttention(Q,K,V) = Concat(head_1,\dots,head_h) W^O\] where \(W^0\in\mathbb{R}^{d_{hd_v}\times d_{model}}\)

关于 attention 在模型中的应用,有三种情况
  • 1.In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder。

  • 2.The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

  • 3.Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −1) all values in the input of the softmax which correspond to illegal connections.

总结下就是:

Transformer 中的attention机制总共有三种情况:

  • 1.encoder模块中的 self-attention,其中 queries, keys, values 都是来自 input_x, 也就是源语言的词表示。通过多层 multi-head attention, FFN, 得到最后的 input sentence 的向量表示,在没有使用RNN,CNN的情况下,其中的每个词都包含了其他所有词的信息,而且效果比 RNN,CNN 得到的向量表示要好。

  • 2.encoder-encoder模块中的 attention. 其中 queries 来自上一个sub-layer, 也就是 decoder 中 masked multi-head attention 的输出,keys-values 来自 encoder 的输出。

  • 3.decoder模块中的 self-attention,其中 queries, keys, values 都是来自于上一个 decoder 的输出。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
def multiheadattention(q,
k,
v,
d_model,
heads,
keys_mask=None,
causality=None,
dropout_keep_prob=0.1,
is_training=True):
""" multi scaled dot product attention

:param q: A 3d tensor with shape of [batch, length_q, d_k].
:param k: A 3d tensor with shape of [batch, lenght_kv, d_k].
:param v:
:param heads:An int. Number of heads.
:param dropout_keep_prob:
:param causality: If true, units that reference the future are masked.
:return:
"""
# 1. Linear projections
with tf.variable_scope('linear-projection-multiheads'):
q_proj = tf.layers.dense(q, d_model) # [batch, lenght_q, d_model]
k_proj = tf.layers.dense(k, d_model) # [batch, lenght_kv, d_model]
v_proj = tf.layers.dense(v, d_model) # [batch, lenght_kv, d_model]

with tf.variable_scope("multihead-attention"):
# d_k = d_v = d_model/heads
if d_model % heads != 0:
raise ValueError("Key\values\query depth (%d) must be divisible by"
"the number of attention heads (%d)" %(d_model, heads))

# 2. split and concat
q_ = tf.concat(tf.split(q_proj, heads, axis=2), axis=0) # [batch*heads, length_q, d_k]
k_ = tf.concat(tf.split(k_proj, heads, axis=2), axis=0) # [batch*heads, length_kv, d_k]
v_ = tf.concat(tf.split(v_proj, heads, axis=2), axis=0) # [batch*heads, length_kv, d_v]

# 3. attention score
# outputs.shape=[batch*heads, length_q, length_kv]
# 要理解这个矩阵运算,对一个keys的句子长度为length_kv,需要计算的其中的每一个词与query中每一个词的內积。所以最后的score是[length_q, lenght_kv]
scalar = tf.rsqrt(d_model/heads) # 1/sqrt(d_k)
outputs = tf.matmul(q_*scalar, k_, transpose_b=True) # [batch*heads, length_q, lenght_kv]

# 4. mask
if keys_mask is not None:
# `y = sign(x) = -1` if `x < 0`; 0 if `x == 0` or `tf.is_nan(x)`; 1 if `x > 0`.
key_masks = tf.sign(tf.abs(tf.reduce_sum(k, axis=-1))) # (batch, length_kv)
key_masks = tf.tile(key_masks, [heads, 1]) # (batch*heads, length_kv)
key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, q.get_shape()[1], 1]) # (batch*heads, length_q, length_kv)

# def where(condition, x=None, y=None, name=None)
# The `condition` tensor acts as a mask that chooses, based on the value at each
# element, whether the corresponding element / row in the output should be taken
# from `x` (if true) or `y` (if false).
paddings = tf.ones_like(outputs) * (-2 ** 32 + 1)
outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # [batch*heads, length_q, lenght_kv]

# Causality = Future blinding
# causality参数告知我们是否屏蔽未来序列的信息(解码器self attention的时候不能看到自己之后的那些信息),
# 这里即causality为True时的屏蔽操作。
if causality:
diag_vals = tf.ones_like(outputs[0, :, :]) # [length_q, lenght_kv]
tril = LinearOperatorLowerTriangular(diag_vals).to_dense() # [length_q, lenght_kv] 得到一个三角阵,下标index大于当前行的值都变为0
masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # [batch*heads, length_q, lenght_kv]

paddings = tf.ones_like(masks) * (-2 ** 32 + 1)
outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # [batch*heads, length_q, lenght_kv]

# 将socre转换为概率
outpts = tf.nn.softmax(outputs)

# Query Masking
query_mask = tf.sign(tf.abs(tf.reduce_sum(q, axis=-1, keepdims=False))) # [batch, lenght_q]
query_mask = tf.tile(query_mask, [heads, 1]) # [batch*heads, length_q] # 目的是为了让query和outputs保持形状一致
query_mask = tf.tile(tf.expand_dims(query_mask, axis=-1), [1, 1, tf.shape(k)[-1]]) # [batch*heads, length_q, length_kv]

paddings = tf.ones_like(outputs) * (-2 ** 32 + 1)
outputs = tf.where(tf.equal(query_mask, 0), paddings, outputs) # [batch*heads, length_q, length_kv]

# Dropout
if is_training:
outputs = tf.layers.dropout(outputs, dropout_keep_prob, )

# weights sum
outputs = tf.matmul(outputs, v_) # [batch*heads, length_q, k_v]


# restore shape
outputs = tf.concat(tf.split(outputs, heads, axis=0), axis=-1) #[batch,length_q, k_v*heads] = [batch, lenght_q, d_model]

# Residual connection
outputs += q # [batch, lenght_q, d_model]

# Normalize
outputs = Normalize(outputs)

return outputs # [batch, length_q, d_model]

关于代码的详细解析,可以看这篇blog 机器翻译模型Transformer代码详细解析.

Stage3: Position-wise Feed-Forward Networks

\[FFN(x) = MAX(0, xW_1+b_1)W_2+b_2\]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def position_wise_feed_forward(inputs,
num_units1=2048,
num_units2=512,
reuse=None):
""" Point-wise feed forward net.

:param inputs: A 3D tensor with shape of [batch, length_q, d_model]
:param num_units1: A integers.
:param num_units2: A integers
:param reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
:return: A 3d tensor with the same shape and dtype as inputs
"""
with tf.variable_scope("feed-forward-networks"):
# inner layers
params1 = {"inputs":inputs, "filters":num_units1, "kernel_size":1,
"activation":tf.nn.relu, "use_bias":True, "strides":1}
outputs = tf.layers.conv1d(**params1)

# readout layer
params2 = {"inputs":outputs, "filters":num_units2, "kernel_size":1,
"activation":None, "use_bias":True, "strides":1}
outputs = tf.layers.conv1d(**params2)

# residual connection
outputs += inputs

# Normalize
outputs = Normalize(outputs)

return outputs


def position_wise_feed_forward_mine(inputs,
num_units1=2048,
num_units2=512,
reuse=None):
with tf.variable_scope("feed-forward-networks"):
W1 = tf.get_variable("weight1", [inputs.get_shape()[-1], num_units1],initializer=xavier_initializer())
b1 = tf.get_variable('bias1', [num_units1], initializer=tf.constant_initializer(0.1))
outputs = tf.einsum('aij,jk->aik', inputs, W1) + b1 # [batch, length_q, num_units1]

W2 = tf.get_variable("weight1", [outputs.get_shape()[-1], num_units2], initializer=xavier_initializer())
b2 = tf.get_variable('bias1', [num_units2], initializer=tf.constant_initializer(0.1))
outputs = tf.einsum('aij,jk->aik', inputs, W2) + b2 # [batch, length_q, num_units1]


# residual connection
outputs += inputs

# Normalize
outputs = Normalize(outputs)

return outputs

encoder 各模块组合在一起

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def _encoder(self):
with tf.variable_scope("encoder"):
# 1. embedding
with tf.variable_scope("embedding-layer"):
self.enc = embedding(inputs=self.input_x,
vocab_size=self.vocab_size_cn,
num_units=self.d_model,
scale=True) # [batch, sentence_len, d_model]

# 2. position encoding
with tf.variable_scope("position_encoding"):
encoding = position_encoding_mine(self.enc.get_shape()[1], self.d_model)
self.enc *= encoding

# 3.dropout
self.enc = tf.layers.dropout(self.enc,
rate=self.dropout_keep_prob,
training=self.is_training)

# 4. Blocks
for i in range(self.num_layers):
with tf.variable_scope("num_layer_{}".format(i)):
# multihead attention
# encoder: self-attention
self.enc = multiheadattention(q=self.enc,
k=self.enc,
v=self.enc,
d_model=self.d_model,
heads=self.heads,
causality=False,
dropout_keep_prob=self.dropout_keep_prob,
is_training=True)
# Feed Froward
self.enc = position_wise_feed_forward(self.enc,
num_units1= 4*self.d_model,
num_units2= self.d_model,
reuse=False)
return self.enc

Decoder

decoder 模块中 self-attention 的初始输入:

1
2
# define decoder inputs
self.decoder_inputs = tf.concat([tf.ones_like(self.input_y[:,:1])*2, self.input_y[:,:-1]],axis=-1) # 2:<S>

与encoder 不同的是,分为 encoder-decoder attention 和 self-attention.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def _decoder(self):
with tf.variable_scope("decoder"):
# embedding
self.dec = embedding(self.decoder_inputs,
vocab_size=self.vocab_size_en,
num_units=self.d_model) # [batch, sentence_len, d_model]

# position decoding
encoding = position_encoding_mine(self.dec.get_shape()[1], self.d_model)
self.dec *= encoding

# blocks
for i in range(self.num_layers):
with tf.variable_scope("num_layers_{}".format(i)):
# self-attention
with tf.variable_scope("self.attention"):
self.dec = multiheadattention(q=self.dec,
k=self.dec,
v=self.dec,
d_model=self.d_model,
heads=self.heads,
keys_mask=True,
causality=True)

# encoder-decoder-attention
with tf.variable_scope("encoder-decoder-attention"):
self.dec = multiheadattention(q=self.dec,
k=self.enc,
v=self.enc,
d_model=self.d_model,
heads=self.heads,
keys_mask=True,
causality=True)

self.dec = position_wise_feed_forward(self.dec,
num_units1= 4*self.d_model,
num_units2= self.d_model) # [batch, sentence_len, d_model]

return self.dec

Optimizer

1
2
3
4
def add_train_op(self):
self.optimizer = tf.train.AdamOptimizer(self.lr, beta1=0.9, beta2=0.98, epsilon=1e-9)
self.train_op = self.optimizer.minimize(self.loss, global_step=self.global_step)
return self.train_op

Regularization

Residual Dropout
label smoothing

During training, we employed label smoothing of value ls = 0:1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

1
2
3
4
5
6
7
8
9
def label_smoothing(inputs, epsilon=0.1):
""" Applies label smoothing. See https://arxiv.org/abs/1512.00567

:param inputs: A 3d tensor with shape of [N, T, V], where V is the number of vocabulary.
:param epsilon: Smoothing rate.
For example,
"""
K = inputs.get_shape().as_list()[-1] # number of channels
return ((1-epsilon) * inputs) + (epsilon/K)

Reference: