Tensorflow Attention API 源码阅读1

这节内容是详细了解下 tensorflow 关于 attention 的api,由于封装的太好,之前使用过发现报错了很难去找出哪儿 bug,所以用 eager execution 来看具体细节。

按照官方教程 Seq2seq Library (contrib) 这里的流程逐步深入。

This library is composed of two primary components:

  • New attention wrappers for tf.contrib.rnn.RNNCell objects.

  • A new object-oriented dynamic decoding framework.

主要包括两个部分,一个是新的基于 attention 的 RNNCell 对象,一个面向对象的动态解码框架。

Attention

Attention wrappers are RNNCell objects that wrap other RNNCell objects and implement attention. The form of attention is determined by a subclass of tf.contrib.seq2seq.AttentionMechanism. These subclasses describe the form of attention (e.g. additive vs. multiplicative) to use when creating the wrapper. An instance of an AttentionMechanism is constructed with a memory tensor, from which lookup keys and values tensors are created.

attenion wrapper 也是 RNNCell 对象,父类是 tf.contrib.seq2seq.AttentionMechanism,然后其子类是针对不同 attention 形式(additive vs. multiplicative)的实现。AttentionMechanism 的构造是在 memory 的基础上,memory 也就是 attention 过程中的 keys values.

Attention Mechanisms

attention 的提出来自于:
- paper: Neural Machine Translation by Jointly Learning to Align and Translate
- paper:Effective Approaches to Attention-based Neural Machine Translation

  1. encoder 采用单层或多层的单向或双向的 rnn 得到source sentence 的隐藏状态表示 \(H=[h_1,...,h_T]\)

  2. decoder 的 t 时间步的隐藏状态为 \(s_t\), 在 decoder 阶段也是 rnn,其中隐藏状态的更新为: \(s_i=f(s_{i-1},y_{i-1},c_i)\) 其中 \(s_{i-1}\) 是上一个隐藏状态,\(y_{i-1}\) 是上一时间步的输出,\(c_i\) 是当前时间步的 attention vector. 那么现在就是怎么计算当前时间步的 \(c_i\).

  3. 当前时间步的 \(e_t=a(s_{i-1}, h_j)\), 这是对齐模型,也就是计算上一个隐藏状态 \(s_{i-1}\) 与 encoder 中每一个 hidden 的 match 程度,计算这个 score 有很多中方式,其中最常见的,也是 tf api 中使用的两种 BahdanauAttention 和 LuongAttention.

\[\text{BahdanauAttention:}\quad e_{ij}=v_a^Ttanh(W^as_{i-1}+U_ah_j)\] \[\text{LuongAttention:}\quad e_{ij}=h_j^TW^as_i\]

  1. 然后对得到的对齐 score 使用 softmax 得到相应的概率: \[a_{ij}=\dfrac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})}\] softmax 实际上相比上面的公式有点区别,就是 \(exp(e^{ij}-max(e^{ik}))\) 防止数值溢出。

  2. 将得到的 \(s_{i-1}\) 与 encoder 中的 \(h_j\) 计算得到的概率与 \(h_j\) 做加权和得到当前时间步的 attention vector \(c_i\)

  3. 在然后使用 \(c_{i-1},s_{i-1},y_{i-1}\) 更新decoder 中的隐藏状态,循环下去。。。

  4. 根据当前的隐藏状态 \(s_i\) 计算得到当前时间步的输出 \(y_t\) \[y_t=Ws_{i}+b\]

先看父类 tf.contrib.seq2seq.AttentionMechanism

源码:

1
2
3
4
5
6
7
8
9
class AttentionMechanism(object):

@property
def alignments_size(self):
raise NotImplementedError

@property
def state_size(self):
raise NotImplementedError

两个属性: alignments_size 和 state_size 分别对应 sequence 的长度,所以这个 alignment_size 是表示 mask 之后的长度吧?接下来看源码。 state_size 表示隐藏层的状态。显然这里的 attention 也是一个时间步内的计算。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
class _BaseAttentionMechanism(AttentionMechanism):
"""A base AttentionMechanism class providing common functionality.
Common functionality includes:
1. Storing the query and memory layers.
2. Preprocessing and storing the memory.
"""

def __init__(self,
query_layer,
memory,
probability_fn,
memory_sequence_length=None,
memory_layer=None,
check_inner_dims_defined=True,
score_mask_value=None,
name=None):
"""Construct base AttentionMechanism class.
Args:
query_layer: Callable. Instance of `tf.layers.Layer`. The layer's depth
must match the depth of `memory_layer`. If `query_layer` is not
provided, the shape of `query` must match that of `memory_layer`.
memory: The memory to query; usually the output of an RNN encoder. This
tensor should be shaped `[batch_size, max_time, ...]`.
probability_fn: A `callable`. Converts the score and previous alignments
to probabilities. Its signature should be:
`probabilities = probability_fn(score, state)`.
memory_sequence_length (optional): Sequence lengths for the batch entries
in memory. If provided, the memory tensor rows are masked with zeros
for values past the respective sequence lengths.
memory_layer: Instance of `tf.layers.Layer` (may be None). The layer's
depth must match the depth of `query_layer`.
If `memory_layer` is not provided, the shape of `memory` must match
that of `query_layer`.
check_inner_dims_defined: Python boolean. If `True`, the `memory`
argument's shape is checked to ensure all but the two outermost
dimensions are fully defined.
score_mask_value: (optional): The mask value for score before passing into
`probability_fn`. The default is -inf. Only used if
`memory_sequence_length` is not None.
name: Name to use when creating ops.
"""
if (query_layer is not None
and not isinstance(query_layer, layers_base.Layer)):
raise TypeError(
"query_layer is not a Layer: %s" % type(query_layer).__name__)
if (memory_layer is not None
and not isinstance(memory_layer, layers_base.Layer)):
raise TypeError(
"memory_layer is not a Layer: %s" % type(memory_layer).__name__)
self._query_layer = query_layer
self._memory_layer = memory_layer
self.dtype = memory_layer.dtype
if not callable(probability_fn):
raise TypeError("probability_fn must be callable, saw type: %s" %
type(probability_fn).__name__)
if score_mask_value is None:
score_mask_value = dtypes.as_dtype(
self._memory_layer.dtype).as_numpy_dtype(-np.inf)
self._probability_fn = lambda score, prev: ( # pylint:disable=g-long-lambda
probability_fn(
_maybe_mask_score(score, memory_sequence_length, score_mask_value),
prev))
with ops.name_scope(
name, "BaseAttentionMechanismInit", nest.flatten(memory)):
self._values = _prepare_memory(
memory, memory_sequence_length,
check_inner_dims_defined=check_inner_dims_defined)
self._keys = (
self.memory_layer(self._values) if self.memory_layer # pylint: disable=not-callable
else self._values)
self._batch_size = (
self._keys.shape[0].value or array_ops.shape(self._keys)[0])
self._alignments_size = (self._keys.shape[1].value or
array_ops.shape(self._keys)[1])

@property
def memory_layer(self):
return self._memory_layer

@property
def query_layer(self):
return self._query_layer

@property
def values(self):
return self._values

@property
def keys(self):
return self._keys

@property
def batch_size(self):
return self._batch_size

@property
def alignments_size(self):
return self._alignments_size

@property
def state_size(self):
return self._alignments_size

def initial_alignments(self, batch_size, dtype):
"""Creates the initial alignment values for the `AttentionWrapper` class.
This is important for AttentionMechanisms that use the previous alignment
to calculate the alignment at the next time step (e.g. monotonic attention).
The default behavior is to return a tensor of all zeros.
Args:
batch_size: `int32` scalar, the batch_size.
dtype: The `dtype`.
Returns:
A `dtype` tensor shaped `[batch_size, alignments_size]`
(`alignments_size` is the values' `max_time`).
"""
max_time = self._alignments_size
return _zero_state_tensors(max_time, batch_size, dtype)

def initial_state(self, batch_size, dtype):
"""Creates the initial state values for the `AttentionWrapper` class.
This is important for AttentionMechanisms that use the previous alignment
to calculate the alignment at the next time step (e.g. monotonic attention).
The default behavior is to return the same output as initial_alignments.
Args:
batch_size: `int32` scalar, the batch_size.
dtype: The `dtype`.
Returns:
A structure of all-zero tensors with shapes as described by `state_size`.
"""
return self.initial_alignments(batch_size, dtype)

这个类 _BaseAttentionMechanism 是最基本的 attention 类了。可以看到 self._keys 和 self._values 的计算方式都是需要考虑 memory_sequence_length 这个参数的。

有这几个属性:
- values: 其计算使用了 _prepare_memory 函数对应的是把输入序列 memory 的超过对应实际长度的部分的值变为 0
- keys: self._keys = self.memory_layer(self._values) 是在得到了 values 之后进行全链接的值,其shape=[batch, max_times, num_units]
- state_size 和 alignment_size 是一样的,都是 max_times
- self._probability_fn(score, prev) 使用了 _maybe_mask_score 这个函数计算得到 score 之后并 mask 的概率,然后还要利用 prev state?

_maybe_mask_score

源码:

1
2
3
4
5
6
7
8
9
10
def _maybe_mask_score(score, memory_sequence_length, score_mask_value):
if memory_sequence_length is None:
return score
message = ("All values in memory_sequence_length must greater than zero.")
with ops.control_dependencies(
[check_ops.assert_positive(memory_sequence_length, message=message)]):
score_mask = array_ops.sequence_mask(
memory_sequence_length, maxlen=array_ops.shape(score)[1])
score_mask_values = score_mask_value * array_ops.ones_like(score)
return array_ops.where(score_mask, score, score_mask_values)

1
2
score = tf.random_uniform(shape=[2,10])
tf.shape(score).numpy()
array([ 2, 10], dtype=int32)
1
2
3
4
5
6
7
8
9
score = tf.random_uniform(shape=[2,10])
memeory_sequence_len = [5,8]
score_mask_value = -100000000
score_mask = tf.sequence_mask(lengths=memeory_sequence_len, maxlen=tf.shape(score)[1])
print("true or false: %s\n" %score_mask)
score_mask_values = score_mask_value * tf.ones_like(score)
print("-inf: %s\n"%score_mask_values)
ans = tf.where(score_mask, score, score_mask_values)
print(ans)
true or false: tf.Tensor(
[[ True  True  True  True  True False False False False False]
 [ True  True  True  True  True  True  True  True False False]], shape=(2, 10), dtype=bool)

-inf: tf.Tensor(
[[-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08
  -1.e+08]
 [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08
  -1.e+08]], shape=(2, 10), dtype=float32)

tf.Tensor(
[[ 2.3987615e-01  4.9896538e-01  7.2822869e-01  4.7516704e-02
   1.6099060e-01 -1.0000000e+08 -1.0000000e+08 -1.0000000e+08
  -1.0000000e+08 -1.0000000e+08]
 [ 3.5503960e-01  2.5502288e-01  8.1264114e-01  4.3110681e-01
   1.1858845e-01  2.5748730e-02  4.8437893e-01  2.8339624e-02
  -1.0000000e+08 -1.0000000e+08]], shape=(2, 10), dtype=float32)

_prepare_memory
1
2
self._keys = _prepare_memory(memory, memory_sequence_length,
check_inner_dims_defined=check_inner_dims_defined)

其中 _prepare_memory 这个函数,也就是怎么计算 mask 的,其计算如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def _prepare_memory(memory, memory_sequence_length, check_inner_dims_defined):
"""Convert to tensor and possibly mask `memory`.
Args:
memory: `Tensor`, shaped `[batch_size, max_time, ...]`.
memory_sequence_length: `int32` `Tensor`, shaped `[batch_size]`.
check_inner_dims_defined: Python boolean. If `True`, the `memory`
argument's shape is checked to ensure all but the two outermost
dimensions are fully defined.
Returns:
A (possibly masked), checked, new `memory`.
Raises:
ValueError: If `check_inner_dims_defined` is `True` and not
`memory.shape[2:].is_fully_defined()`.
"""
memory = nest.map_structure(
lambda m: ops.convert_to_tensor(m, name="memory"), memory)
if memory_sequence_length is not None:
memory_sequence_length = ops.convert_to_tensor(
memory_sequence_length, name="memory_sequence_length")
if check_inner_dims_defined:
def _check_dims(m):
if not m.get_shape()[2:].is_fully_defined():
raise ValueError("Expected memory %s to have fully defined inner dims, "
"but saw shape: %s" % (m.name, m.get_shape()))
nest.map_structure(_check_dims, memory)
if memory_sequence_length is None:
seq_len_mask = None
else:
seq_len_mask = array_ops.sequence_mask(
memory_sequence_length,
maxlen=array_ops.shape(nest.flatten(memory)[0])[1],
dtype=nest.flatten(memory)[0].dtype)
seq_len_batch_size = (
memory_sequence_length.shape[0].value
or array_ops.shape(memory_sequence_length)[0])
def _maybe_mask(m, seq_len_mask):
rank = m.get_shape().ndims
rank = rank if rank is not None else array_ops.rank(m)
extra_ones = array_ops.ones(rank - 2, dtype=dtypes.int32)
m_batch_size = m.shape[0].value or array_ops.shape(m)[0]
if memory_sequence_length is not None:
message = ("memory_sequence_length and memory tensor batch sizes do not "
"match.")
with ops.control_dependencies([
check_ops.assert_equal(
seq_len_batch_size, m_batch_size, message=message)]):
seq_len_mask = array_ops.reshape(
seq_len_mask,
array_ops.concat((array_ops.shape(seq_len_mask), extra_ones), 0))
return m * seq_len_mask
else:
return m
return nest.map_structure(lambda m: _maybe_mask(m, seq_len_mask), memory)

_prepare_memory 其实很简单,就是根据 batch 中每个样本的实际长度,将超出部分设置为 0

tf.contrib.seq2seq.BahdanauAttention

这里涉及到了两篇 paper:
- Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015.
- Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks."

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class BahdanauAttention(_BaseAttentionMechanism):
"""Implements Bahdanau-style (additive) attention.
This attention has two forms. The first is Bahdanau attention,
as described in:
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
"Neural Machine Translation by Jointly Learning to Align and Translate."
ICLR 2015. https://arxiv.org/abs/1409.0473
The second is the normalized form. This form is inspired by the
weight normalization article:
Tim Salimans, Diederik P. Kingma.
"Weight Normalization: A Simple Reparameterization to Accelerate
Training of Deep Neural Networks."
https://arxiv.org/abs/1602.07868
To enable the second form, construct the object with parameter
`normalize=True`.
"""

def __init__(self,
num_units,
memory,
memory_sequence_length=None,
normalize=False,
probability_fn=None,
score_mask_value=None,
dtype=None,
name="BahdanauAttention"):
"""Construct the Attention mechanism.
Args:
num_units: The depth of the query mechanism.
memory: The memory to query; usually the output of an RNN encoder. This
tensor should be shaped `[batch_size, max_time, ...]`.
memory_sequence_length (optional): Sequence lengths for the batch entries
in memory. If provided, the memory tensor rows are masked with zeros
for values past the respective sequence lengths.
normalize: Python boolean. Whether to normalize the energy term.
probability_fn: (optional) A `callable`. Converts the score to
probabilities. The default is @{tf.nn.softmax}. Other options include
@{tf.contrib.seq2seq.hardmax} and @{tf.contrib.sparsemax.sparsemax}.
Its signature should be: `probabilities = probability_fn(score)`.
score_mask_value: (optional): The mask value for score before passing into
`probability_fn`. The default is -inf. Only used if
`memory_sequence_length` is not None.
dtype: The data type for the query and memory layers of the attention
mechanism.
name: Name to use when creating ops.
"""
  • num_units 是query mechanism 的维度. 它可以既不是 query 的维度,也可以不是 memory 的维度对吧?
  • query 的维度要和 memory(也就是 keys/values) 的维度一致吗?是不需要的.在 BahdanauAttention 的实现中比较好理解,两个全链接最后的维度一致即可相加.但是在 LuongAttention 中矩阵矩阵相乘时需要注意维度变化.
  • memory_sequence_length: 这个参数很重要, mask 消除 padding 的影响.
  • score_mask_value: 上一个参数存在时,这个参数才会使用,默认为 -inf.

继续看源码的实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
if probability_fn is None:
probability_fn = nn_ops.softmax
if dtype is None:
dtype = dtypes.float32
wrapped_probability_fn = lambda score, _: probability_fn(score)
super(BahdanauAttention, self).__init__(
query_layer=layers_core.Dense(
num_units, name="query_layer", use_bias=False, dtype=dtype),
memory_layer=layers_core.Dense(
num_units, name="memory_layer", use_bias=False, dtype=dtype),
memory=memory,
probability_fn=wrapped_probability_fn,
memory_sequence_length=memory_sequence_length,
score_mask_value=score_mask_value,
name=name)
self._num_units = num_units
self._normalize = normalize
self._name = name

  • 现在理解了 _BaseAttentionMechanism 这个类中 query_layer 和 memory_layer 的意义了.
  • score_mask_value 沿用父类中的计算方式.

继续看 call 函数,也就是 attention 的计算方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def __call__(self, query, state):
"""Score the query based on the keys and values.
Args:
query: Tensor of dtype matching `self.values` and shape
`[batch_size, query_depth]`.
state: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]`
(`alignments_size` is memory's `max_time`).
Returns:
alignments: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]` (`alignments_size` is memory's
`max_time`).
"""
with variable_scope.variable_scope(None, "bahdanau_attention", [query]):
processed_query = self.query_layer(query) if self.query_layer else query
score = _bahdanau_score(processed_query, self._keys, self._normalize)
alignments = self._probability_fn(score, state)
next_state = alignments
return alignments, next_state

然后看怎么计算的 score.
score = _bahdanau_score(processed_query, self._keys, self._normalize) 其中 processed_query 和 self._keys 都是通过全链接层后得到的, [batch, alignments_size, num_units]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def _bahdanau_score(processed_query, keys, normalize):
"""Implements Bahdanau-style (additive) scoring function.
"""
dtype = processed_query.dtype
# Get the number of hidden units from the trailing dimension of keys
num_units = keys.shape[2].value or array_ops.shape(keys)[2]
# Reshape from [batch_size, ...] to [batch_size, 1, ...] for broadcasting.
processed_query = array_ops.expand_dims(processed_query, 1)
v = variable_scope.get_variable(
"attention_v", [num_units], dtype=dtype)
if normalize:
# Scalar used in weight normalization
g = variable_scope.get_variable(
"attention_g", dtype=dtype,
initializer=init_ops.constant_initializer(math.sqrt((1. / num_units))),
shape=())
# Bias added prior to the nonlinearity
b = variable_scope.get_variable(
"attention_b", [num_units], dtype=dtype,
initializer=init_ops.zeros_initializer())
# normed_v = g * v / ||v||
normed_v = g * v * math_ops.rsqrt(
math_ops.reduce_sum(math_ops.square(v)))
return math_ops.reduce_sum(
normed_v * math_ops.tanh(keys + processed_query + b), [2])
else:
return math_ops.reduce_sum(v * math_ops.tanh(keys + processed_query), [2])

源码中计算 score 的最后一步不是全链接,而是这样的:

1
2
v = tf.get_variable("attention_v", [num_units])
score = tf.reduce_sum(v * tanh(keys + processed_query), [2])

1
2
3
4
5
6
7
8
9
10
11
import tensorflow as tf
import numpy as np
import tensorflow.contrib.eager as tfe

tf.enable_eager_execution()
print(tfe.executing_eagerly())

memory = tf.ones(shape=[1, 10, 5]) # batch=1, max_sequence_len=10, embed_size=5
memory_sequence_len = [5] # 有效长度为 5
attention_mechnism = tf.contrib.seq2seq.BahdanauAttention(num_units=32, memory=memory,
memory_sequence_length=memory_sequence_len)
True
1
print(attention_mechnism.state_size, attention_mechnism.alignments_size)
10 10
1
memory
<tf.Tensor: id=3, shape=(1, 10, 5), dtype=float32, numpy=
array([[[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]]], dtype=float32)>
1
attention_mechnism.values   # 可以发现 values 就是把 memory 中超过memory_sequence_length 的部分变为 0
<tf.Tensor: id=30, shape=(1, 10, 5), dtype=float32, numpy=
array([[[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]], dtype=float32)>
1
2
print(attention_mechnism.keys.shape)  # 经过了全链接之后的
attention_mechnism.keys.numpy()[0,1,:]
(1, 10, 32)

array([ 0.09100786,  0.18448338, -0.7751561 ,  0.00775184,  0.467805  ,
        0.9172474 ,  0.57645243, -0.3915946 , -0.22213435,  0.76866853,
        0.3591721 ,  0.8922573 ,  0.15866229,  0.6033571 ,  0.51816225,
        0.3820553 , -0.39130217,  0.04532939, -0.02089322,  0.6878175 ,
       -0.28697258,  0.59283376, -0.37825382, -0.5865691 ,  0.17466056,
       -0.5915747 ,  0.6070496 , -0.18531135, -0.821724  ,  1.2838829 ,
        0.15700272, -0.2608306 ], dtype=float32)
1
print(attention_mechnism.query_layer, attention_mechnism.memory_layer)
<tensorflow.python.layers.core.Dense object at 0x7fa0464da908> <tensorflow.python.layers.core.Dense object at 0x7fa0464dab38>
1
2
3
4
# 利用 call 函数来计算下一个 state 和 attention vector
query = tf.ones(shape=[1, 8]) # query_depth = 10
state_h0 = attention_mechnism.initial_alignments(batch_size=1, dtype=tf.float32)
attention_vector = attention_mechnism(query=query, state=state_h0)
1
print(attention_vector)
(<tf.Tensor: id=125, shape=(1, 10), dtype=float32, numpy=array([[0.2, 0.2, 0.2, 0.2, 0.2, 0. , 0. , 0. , 0. , 0. ]], dtype=float32)>, <tf.Tensor: id=125, shape=(1, 10), dtype=float32, numpy=array([[0.2, 0.2, 0.2, 0.2, 0.2, 0. , 0. , 0. , 0. , 0. ]], dtype=float32)>)

tf.contrib.seq2seq.LuongAttention

paper: Effective Approaches to Attention-based Neural Machine Translation, EMNLP 2015.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class LuongAttention(_BaseAttentionMechanism):
"""Implements Luong-style (multiplicative) attention scoring.
"""
def __init__(self,
num_units,
memory,
memory_sequence_length=None,
scale=False,
probability_fn=None,
score_mask_value=None,
dtype=None,
name="LuongAttention"):
if probability_fn is None:
probability_fn = nn_ops.softmax
if dtype is None:
dtype = dtypes.float32
wrapped_probability_fn = lambda score, _: probability_fn(score)
super(LuongAttention, self).__init__(
query_layer=None,
memory_layer=layers_core.Dense(
num_units, name="memory_layer", use_bias=False, dtype=dtype),
memory=memory,
probability_fn=wrapped_probability_fn,
memory_sequence_length=memory_sequence_length,
score_mask_value=score_mask_value,
name=name)
self._num_units = num_units
self._scale = scale
self._name = name

可以发现 query 没有经过 query_layer 的处理,也就是没有全链接。但是 memory 还是要用全链接处理的,得到 [batch, max_times, num_units]

再看使用 call 函数计算对其概率 alignment 和 next_state.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def __call__(self, query, state):
"""Score the query based on the keys and values.
Args:
query: Tensor of dtype matching `self.values` and shape
`[batch_size, query_depth]`.
state: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]`
(`alignments_size` is memory's `max_time`).
Returns:
alignments: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]` (`alignments_size` is memory's
`max_time`).
"""
with variable_scope.variable_scope(None, "luong_attention", [query]):
score = _luong_score(query, self._keys, self._scale)
alignments = self._probability_fn(score, state)
next_state = alignments
return alignments, next_state

接下来看怎么计算的 score

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def _luong_score(query, keys, scale):
"""Implements Luong-style (multiplicative) scoring function.
Args:
query: Tensor, shape `[batch_size, num_units]` to compare to keys.
keys: Processed memory, shape `[batch_size, max_time, num_units]`.
scale: Whether to apply a scale to the score function.
Returns:
A `[batch_size, max_time]` tensor of unnormalized score values.
Raises:
ValueError: If `key` and `query` depths do not match.
"""
depth = query.get_shape()[-1]
key_units = keys.get_shape()[-1]
if depth != key_units:
raise ValueError(
"Incompatible or unknown inner dimensions between query and keys. "
"Query (%s) has units: %s. Keys (%s) have units: %s. "
"Perhaps you need to set num_units to the keys' dimension (%s)?"
% (query, depth, keys, key_units, key_units))
dtype = query.dtype

# Reshape from [batch_size, depth] to [batch_size, 1, depth]
# for matmul.
query = array_ops.expand_dims(query, 1)

# Inner product along the query units dimension.
# matmul shapes: query is [batch_size, 1, depth] and
# keys is [batch_size, max_time, depth].
# the inner product is asked to **transpose keys' inner shape** to get a
# batched matmul on:
# [batch_size, 1, depth] . [batch_size, depth, max_time]
# resulting in an output shape of:
# [batch_size, 1, max_time].
# we then squeeze out the center singleton dimension.
score = math_ops.matmul(query, keys, transpose_b=True)
score = array_ops.squeeze(score, [1])

if scale:
# Scalar used in weight scaling
g = variable_scope.get_variable(
"attention_g", dtype=dtype,
initializer=init_ops.ones_initializer, shape=())
score = g * score
return score

通过源码可以发现 LuongAttention 调用 call 函数时,其 query 的维度必须是 num_units. 而 BahdanauAttention 并不需要。

其是计算 score 的方式如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
batch_size = 2
query_depth = num_units = 32
memory_depth = 15
max_times = 10
embed_size = 5
scale = True

query = tf.random_normal(shape=[batch_size, num_units])
# memory = tf.random_normal(shape=[batch_size, max_times, memory_depth])
# values = self._prepaer_memory(memory)
# keys = memory_layer(values)
values = tf.random_normal(shape=[batch_size, max_times, memory_depth])
keys = tf.layers.dense(inputs=values, units=num_units) # [batch, max_times, num_units]

query = tf.expand_dims(query, axis=1) # [batch, 1, num_units]
score = tf.matmul(query, keys, transpose_b=True) # [batch, 1, max_times]

score = tf.squeeze(score, axis=1) # [batch, max_times]
print(score.shape)
(2, 10)
1
2
3
4
5
6
7
8
9
10
11
12
### 完整的过一遍
memory = tf.random_normal(shape=[batch_size, max_times, memory_depth])
memory_sequence_len = [5,8]
query_len = 5

query = tf.random_normal(shape=[batch_size, num_units])
state = tf.zeros(shape=[batch_size, max_times])
attention_mechnism = tf.contrib.seq2seq.LuongAttention(num_units=num_units,
memory=memory,
memory_sequence_length=memory_sequence_len)
attention_vector = attention_mechnism(query, state)
attention_vector[0], attention_vector[1] # attention_vector 和 state
(<tf.Tensor: id=1024, shape=(2, 10), dtype=float32, numpy=
 array([[3.6951914e-01, 5.4255807e-01, 2.4851409e-03, 1.8003594e-02,
         6.7433923e-02, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
         0.0000000e+00, 0.0000000e+00],
        [3.5050314e-05, 1.2835214e-02, 1.6479974e-03, 1.4405438e-06,
         3.3324495e-01, 6.4109474e-01, 1.0316775e-02, 8.2380348e-04,
         0.0000000e+00, 0.0000000e+00]], dtype=float32)>,
 <tf.Tensor: id=1024, shape=(2, 10), dtype=float32, numpy=
 array([[3.6951914e-01, 5.4255807e-01, 2.4851409e-03, 1.8003594e-02,
         6.7433923e-02, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
         0.0000000e+00, 0.0000000e+00],
        [3.5050314e-05, 1.2835214e-02, 1.6479974e-03, 1.4405438e-06,
         3.3324495e-01, 6.4109474e-01, 1.0316775e-02, 8.2380348e-04,
         0.0000000e+00, 0.0000000e+00]], dtype=float32)>)
1
tf.reduce_sum(attention_vector[0][1]).numpy()
1.0

这只是针对单个 query 的情况,但实际上 query 一般是这样的 [batch, query_len, num_units],那怎么办呢?

总结

最后总结一下

再看一遍两个 attention 初始化的差异

1
2
3
4
5
6
7
8
9
10
super(BahdanauAttention, self).__init__(
query_layer=layers_core.Dense(
num_units, name="query_layer", use_bias=False, dtype=dtype),
memory_layer=layers_core.Dense(
num_units, name="memory_layer", use_bias=False, dtype=dtype),
memory=memory,
probability_fn=wrapped_probability_fn,
memory_sequence_length=memory_sequence_length,
score_mask_value=score_mask_value,
name=name)

1
2
3
4
5
6
7
8
9
super(LuongAttention, self).__init__(
query_layer=None,
memory_layer=layers_core.Dense(
num_units, name="memory_layer", use_bias=False, dtype=dtype),
memory=memory,
probability_fn=wrapped_probability_fn,
memory_sequence_length=memory_sequence_length,
score_mask_value=score_mask_value,
name=name)

作为一个类对象时,AttentionMechanismBahdanauAttentionLuongAttention它们具有如下属性:

  • query_layer: 在 BahdanauAttention 中一般是 tf.layer.dense 的实例对象,其维度是 num_units. 所以 BahdanauAttention 中 query 的维度可以是任意值。而 LuongAttention 中 query_layer 为 None,所以 query 的维度只能是 num_units.
  • memory_layer: 在两个 attention 中都是一样的,tf.layer.dense,且维度为 num_units.
  • alignments_size: 对齐size,是 memory 的 max_times.
  • batch_size: 批量大小
  • values: 是经过 mask 处理后的 memory. [batch, max_times, embed_size]
  • keys: 是经过 memory_layer 全链接处理后的。 [batch, max_times, num_units].
  • state_size: 等于 alignment_size.

然后是对应的方法:
init: 初始化类实例,里面的参数:
- num_units: 在 Bahdanau 中这个参数其实是个中间值,将 query 和 keys 转化为这个维度,叠加,但最后还是要在这个维度上 reduce_sum. 但是在 LuongAttention 中它必须和 query 的维度一致,然后和 memory_layer 处理后的 memory 做矩阵相乘。
- memory: [batch, max_times, embed_size]
- normalize: 是佛有归一化
- probability_fn: tf.nn.softmaxtf.contrib.seq2seq.hardmaxtf.contrib.sparsemax.sparsemax
- memory_sequence_length: 没有经过 padding 时 memory 的长度。其维度应该是 [1, batch_size]

call(query, state) 调用该实例
- query: [batch_size, query_length]. 在 LuongAttention 中 query_length 必须等于 num_units.
- state: [batch_size, alignments_size].

一直不太理解 state 有啥用?在源码中是用来计算 alignments 的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
alignments = self._probability_fn(score, state)

self._probability_fn = lambda score, prev: ( # pylint:disable=g-long-lambda
probability_fn(
_maybe_mask_score(score, memory_sequence_length, score_mask_value),
prev))

# 其中 score 是可能需要 mask 的. probability_fn 是 tf.nn.softmax. 所以呢????不需要 prev 啊?

# 然后发现确实不需要啊。。。一步步往上找

probability_fn=wrapped_probability_fn

wrapped_probability_fn = lambda score, _: probability_fn(score)

initial_alignments(batch_size, dtype) 初始化对齐
Args:
- batch_size: int32 scalar, the batch_size.
- dtype: The dtype.

Returns:
- A dtype tensor shaped [batch_size, alignments_size]

initial_state(batch_size, dtype):
Creates the initial state values for the AttentionWrapper class.

  • batch_size: int32.