Tensorflow Attention API 源码阅读1

这节内容是详细了解下 tensorflow 关于 attention 的api,由于封装的太好,之前使用过发现报错了很难去找出哪儿 bug,所以用 eager execution 来看具体细节。

按照官方教程 Seq2seq Library (contrib) 这里的流程逐步深入。

This library is composed of two primary components:

  • New attention wrappers for tf.contrib.rnn.RNNCell objects.
  • A new object-oriented dynamic decoding framework.

主要包括两个部分,一个是新的基于 attention 的 RNNCell 对象,一个面向对象的动态解码框架。

Attention

Attention wrappers are RNNCell objects that wrap other RNNCell objects and implement attention. The form of attention is determined by a subclass of tf.contrib.seq2seq.AttentionMechanism. These subclasses describe the form of attention (e.g. additive vs. multiplicative) to use when creating the wrapper. An instance of an AttentionMechanism is constructed with a memory tensor, from which lookup keys and values tensors are created.

attenion wrapper 也是 RNNCell 对象,父类是 tf.contrib.seq2seq.AttentionMechanism,然后其子类是针对不同 attention 形式(additive vs. multiplicative)的实现。AttentionMechanism 的构造是在 memory 的基础上,memory 也就是 attention 过程中的 keys values.

Attention Mechanisms

attention 的提出来自于:

  1. encoder 采用单层或多层的单向或双向的 rnn 得到source sentence 的隐藏状态表示 $H=[h_1,…,h_T]$。
  1. decoder 的 t 时间步的隐藏状态为 $s_t$, 在 decoder 阶段也是 rnn,其中隐藏状态的更新为: $s_i=f(s_{i-1},y_{i-1},c_i)$ 其中 $s_{i-1}$ 是上一个隐藏状态,$y_{i-1}$ 是上一时间步的输出,$c_i$ 是当前时间步的 attention vector. 那么现在就是怎么计算当前时间步的 $c_i$.
  1. 当前时间步的 $e_t=a(s_{i-1}, h_j)$, 这是对齐模型,也就是计算上一个隐藏状态 $s_{i-1}$ 与 encoder 中每一个 hidden 的 match 程度,计算这个 score 有很多中方式,其中最常见的,也是 tf api 中使用的两种 BahdanauAttention 和 LuongAttention.

$$\text{BahdanauAttention:}\quad e_{ij}=v_a^Ttanh(W^as_{i-1}+U_ah_j)$$

$$\text{LuongAttention:}\quad e_{ij}=h_j^TW^as_i$$

  1. 然后对得到的对齐 score 使用 softmax 得到相应的概率:

$$a_{ij}=\dfrac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})}$$

softmax 实际上相比上面的公式有点区别,就是 $exp(e^{ij}-max(e^{ik}))$ 防止数值溢出。

  1. 将得到的 $s_{i-1}$ 与 encoder 中的 $h_j$ 计算得到的概率与 $h_j$ 做加权和得到当前时间步的 attention vector $c_i$
  1. 在然后使用 $c_{i-1},s_{i-1},y_{i-1}$ 更新decoder 中的隐藏状态,循环下去。。。
  1. 根据当前的隐藏状态 $s_i$ 计算得到当前时间步的输出 $y_t$

$$y_t=Ws_{i}+b$$

先看父类 tf.contrib.seq2seq.AttentionMechanism

源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

class AttentionMechanism(object):



@property

def alignments_size(self):

raise NotImplementedError



@property

def state_size(self):

raise NotImplementedError

两个属性: alignments_size 和 state_size 分别对应 sequence 的长度,所以这个 alignment_size 是表示 mask 之后的长度吧?接下来看源码。 state_size 表示隐藏层的状态。显然这里的 attention 也是一个时间步内的计算。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261

class _BaseAttentionMechanism(AttentionMechanism):

"""A base AttentionMechanism class providing common functionality.

Common functionality includes:

1. Storing the query and memory layers.

2. Preprocessing and storing the memory.

"""



def __init__(self,

query_layer,

memory,

probability_fn,

memory_sequence_length=None,

memory_layer=None,

check_inner_dims_defined=True,

score_mask_value=None,

name=None):

"""Construct base AttentionMechanism class.

Args:

query_layer: Callable. Instance of `tf.layers.Layer`. The layer's depth

must match the depth of `memory_layer`. If `query_layer` is not

provided, the shape of `query` must match that of `memory_layer`.

memory: The memory to query; usually the output of an RNN encoder. This

tensor should be shaped `[batch_size, max_time, ...]`.

probability_fn: A `callable`. Converts the score and previous alignments

to probabilities. Its signature should be:

`probabilities = probability_fn(score, state)`.

memory_sequence_length (optional): Sequence lengths for the batch entries

in memory. If provided, the memory tensor rows are masked with zeros

for values past the respective sequence lengths.

memory_layer: Instance of `tf.layers.Layer` (may be None). The layer's

depth must match the depth of `query_layer`.

If `memory_layer` is not provided, the shape of `memory` must match

that of `query_layer`.

check_inner_dims_defined: Python boolean. If `True`, the `memory`

argument's shape is checked to ensure all but the two outermost

dimensions are fully defined.

score_mask_value: (optional): The mask value for score before passing into

`probability_fn`. The default is -inf. Only used if

`memory_sequence_length` is not None.

name: Name to use when creating ops.

"""

if (query_layer is not None

and not isinstance(query_layer, layers_base.Layer)):

raise TypeError(

"query_layer is not a Layer: %s" % type(query_layer).__name__)

if (memory_layer is not None

and not isinstance(memory_layer, layers_base.Layer)):

raise TypeError(

"memory_layer is not a Layer: %s" % type(memory_layer).__name__)

self._query_layer = query_layer

self._memory_layer = memory_layer

self.dtype = memory_layer.dtype

if not callable(probability_fn):

raise TypeError("probability_fn must be callable, saw type: %s" %

type(probability_fn).__name__)

if score_mask_value is None:

score_mask_value = dtypes.as_dtype(

self._memory_layer.dtype).as_numpy_dtype(-np.inf)

self._probability_fn = lambda score, prev: ( # pylint:disable=g-long-lambda

probability_fn(

_maybe_mask_score(score, memory_sequence_length, score_mask_value),

prev))

with ops.name_scope(

name, "BaseAttentionMechanismInit", nest.flatten(memory)):

self._values = _prepare_memory(

memory, memory_sequence_length,

check_inner_dims_defined=check_inner_dims_defined)

self._keys = (

self.memory_layer(self._values) if self.memory_layer # pylint: disable=not-callable

else self._values)

self._batch_size = (

self._keys.shape[0].value or array_ops.shape(self._keys)[0])

self._alignments_size = (self._keys.shape[1].value or

array_ops.shape(self._keys)[1])



@property

def memory_layer(self):

return self._memory_layer



@property

def query_layer(self):

return self._query_layer



@property

def values(self):

return self._values



@property

def keys(self):

return self._keys



@property

def batch_size(self):

return self._batch_size



@property

def alignments_size(self):

return self._alignments_size



@property

def state_size(self):

return self._alignments_size



def initial_alignments(self, batch_size, dtype):

"""Creates the initial alignment values for the `AttentionWrapper` class.

This is important for AttentionMechanisms that use the previous alignment

to calculate the alignment at the next time step (e.g. monotonic attention).

The default behavior is to return a tensor of all zeros.

Args:

batch_size: `int32` scalar, the batch_size.

dtype: The `dtype`.

Returns:

A `dtype` tensor shaped `[batch_size, alignments_size]`

(`alignments_size` is the values' `max_time`).

"""

max_time = self._alignments_size

return _zero_state_tensors(max_time, batch_size, dtype)



def initial_state(self, batch_size, dtype):

"""Creates the initial state values for the `AttentionWrapper` class.

This is important for AttentionMechanisms that use the previous alignment

to calculate the alignment at the next time step (e.g. monotonic attention).

The default behavior is to return the same output as initial_alignments.

Args:

batch_size: `int32` scalar, the batch_size.

dtype: The `dtype`.

Returns:

A structure of all-zero tensors with shapes as described by `state_size`.

"""

return self.initial_alignments(batch_size, dtype)

这个类 _BaseAttentionMechanism 是最基本的 attention 类了。可以看到 self._keys 和 self._values 的计算方式都是需要考虑 memory_sequence_length 这个参数的。

有这几个属性:

  • values: 其计算使用了 _prepare_memory 函数对应的是把输入序列 memory 的超过对应实际长度的部分的值变为 0

  • keys: self._keys = self.memory_layer(self._values) 是在得到了 values 之后进行全链接的值,其shape=[batch, max_times, num_units]

  • state_size 和 alignment_size 是一样的,都是 max_times

  • self._probability_fn(score, prev) 使用了 _maybe_mask_score 这个函数计算得到 score 之后并 mask 的概率,然后还要利用 prev state?

_maybe_mask_score

源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

def _maybe_mask_score(score, memory_sequence_length, score_mask_value):

if memory_sequence_length is None:

return score

message = ("All values in memory_sequence_length must greater than zero.")

with ops.control_dependencies(

[check_ops.assert_positive(memory_sequence_length, message=message)]):

score_mask = array_ops.sequence_mask(

memory_sequence_length, maxlen=array_ops.shape(score)[1])

score_mask_values = score_mask_value * array_ops.ones_like(score)

return array_ops.where(score_mask, score, score_mask_values)

1
2
3
4
5

score = tf.random_uniform(shape=[2,10])

tf.shape(score).numpy()

array([ 2, 10], dtype=int32)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

score = tf.random_uniform(shape=[2,10])

memeory_sequence_len = [5,8]

score_mask_value = -100000000

score_mask = tf.sequence_mask(lengths=memeory_sequence_len, maxlen=tf.shape(score)[1])

print("true or false: %s\n" %score_mask)

score_mask_values = score_mask_value * tf.ones_like(score)

print("-inf: %s\n"%score_mask_values)

ans = tf.where(score_mask, score, score_mask_values)

print(ans)

true or false: tf.Tensor(

[[ True  True  True  True  True False False False False False]

 [ True  True  True  True  True  True  True  True False False]], shape=(2, 10), dtype=bool)



-inf: tf.Tensor(

[[-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08

  -1.e+08]

 [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08

  -1.e+08]], shape=(2, 10), dtype=float32)



tf.Tensor(

[[ 2.3987615e-01  4.9896538e-01  7.2822869e-01  4.7516704e-02

   1.6099060e-01 -1.0000000e+08 -1.0000000e+08 -1.0000000e+08

  -1.0000000e+08 -1.0000000e+08]

 [ 3.5503960e-01  2.5502288e-01  8.1264114e-01  4.3110681e-01

   1.1858845e-01  2.5748730e-02  4.8437893e-01  2.8339624e-02

  -1.0000000e+08 -1.0000000e+08]], shape=(2, 10), dtype=float32)

_prepare_memory\

1
2
3
4
5

self._keys = _prepare_memory(memory, memory_sequence_length,

check_inner_dims_defined=check_inner_dims_defined)

其中 _prepare_memory 这个函数,也就是怎么计算 mask 的,其计算如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

def _prepare_memory(memory, memory_sequence_length, check_inner_dims_defined):

"""Convert to tensor and possibly mask `memory`.

Args:

memory: `Tensor`, shaped `[batch_size, max_time, ...]`.

memory_sequence_length: `int32` `Tensor`, shaped `[batch_size]`.

check_inner_dims_defined: Python boolean. If `True`, the `memory`

argument's shape is checked to ensure all but the two outermost

dimensions are fully defined.

Returns:

A (possibly masked), checked, new `memory`.

Raises:

ValueError: If `check_inner_dims_defined` is `True` and not

`memory.shape[2:].is_fully_defined()`.

"""

memory = nest.map_structure(

lambda m: ops.convert_to_tensor(m, name="memory"), memory)

if memory_sequence_length is not None:

memory_sequence_length = ops.convert_to_tensor(

memory_sequence_length, name="memory_sequence_length")

if check_inner_dims_defined:

def _check_dims(m):

if not m.get_shape()[2:].is_fully_defined():

raise ValueError("Expected memory %s to have fully defined inner dims, "

"but saw shape: %s" % (m.name, m.get_shape()))

nest.map_structure(_check_dims, memory)

if memory_sequence_length is None:

seq_len_mask = None

else:

seq_len_mask = array_ops.sequence_mask(

memory_sequence_length,

maxlen=array_ops.shape(nest.flatten(memory)[0])[1],

dtype=nest.flatten(memory)[0].dtype)

seq_len_batch_size = (

memory_sequence_length.shape[0].value

or array_ops.shape(memory_sequence_length)[0])

def _maybe_mask(m, seq_len_mask):

rank = m.get_shape().ndims

rank = rank if rank is not None else array_ops.rank(m)

extra_ones = array_ops.ones(rank - 2, dtype=dtypes.int32)

m_batch_size = m.shape[0].value or array_ops.shape(m)[0]

if memory_sequence_length is not None:

message = ("memory_sequence_length and memory tensor batch sizes do not "

"match.")

with ops.control_dependencies([

check_ops.assert_equal(

seq_len_batch_size, m_batch_size, message=message)]):

seq_len_mask = array_ops.reshape(

seq_len_mask,

array_ops.concat((array_ops.shape(seq_len_mask), extra_ones), 0))

return m * seq_len_mask

else:

return m

return nest.map_structure(lambda m: _maybe_mask(m, seq_len_mask), memory)

_prepare_memory 其实很简单,就是根据 batch 中每个样本的实际长度,将超出部分设置为 0

tf.contrib.seq2seq.BahdanauAttention

这里涉及到了两篇 paper:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

class BahdanauAttention(_BaseAttentionMechanism):

"""Implements Bahdanau-style (additive) attention.

This attention has two forms. The first is Bahdanau attention,

as described in:

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.

"Neural Machine Translation by Jointly Learning to Align and Translate."

ICLR 2015. https://arxiv.org/abs/1409.0473

The second is the normalized form. This form is inspired by the

weight normalization article:

Tim Salimans, Diederik P. Kingma.

"Weight Normalization: A Simple Reparameterization to Accelerate

Training of Deep Neural Networks."

https://arxiv.org/abs/1602.07868

To enable the second form, construct the object with parameter

`normalize=True`.

"""



def __init__(self,

num_units,

memory,

memory_sequence_length=None,

normalize=False,

probability_fn=None,

score_mask_value=None,

dtype=None,

name="BahdanauAttention"):

"""Construct the Attention mechanism.

Args:

num_units: The depth of the query mechanism.

memory: The memory to query; usually the output of an RNN encoder. This

tensor should be shaped `[batch_size, max_time, ...]`.

memory_sequence_length (optional): Sequence lengths for the batch entries

in memory. If provided, the memory tensor rows are masked with zeros

for values past the respective sequence lengths.

normalize: Python boolean. Whether to normalize the energy term.

probability_fn: (optional) A `callable`. Converts the score to

probabilities. The default is @{tf.nn.softmax}. Other options include

@{tf.contrib.seq2seq.hardmax} and @{tf.contrib.sparsemax.sparsemax}.

Its signature should be: `probabilities = probability_fn(score)`.

score_mask_value: (optional): The mask value for score before passing into

`probability_fn`. The default is -inf. Only used if

`memory_sequence_length` is not None.

dtype: The data type for the query and memory layers of the attention

mechanism.

name: Name to use when creating ops.

"""

  • num_units 是query mechanism 的维度. 它可以既不是 query 的维度,也可以不是 memory 的维度对吧?

  • query 的维度要和 memory(也就是 keys/values) 的维度一致吗?是不需要的.在 BahdanauAttention 的实现中比较好理解,两个全链接最后的维度一致即可相加.但是在 LuongAttention 中矩阵矩阵相乘时需要注意维度变化.

  • memory_sequence_length: 这个参数很重要, mask 消除 padding 的影响.

  • score_mask_value: 上一个参数存在时,这个参数才会使用,默认为 -inf.

继续看源码的实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

if probability_fn is None:

probability_fn = nn_ops.softmax

if dtype is None:

dtype = dtypes.float32

wrapped_probability_fn = lambda score, _: probability_fn(score)

super(BahdanauAttention, self).__init__(

query_layer=layers_core.Dense(

num_units, name="query_layer", use_bias=False, dtype=dtype),

memory_layer=layers_core.Dense(

num_units, name="memory_layer", use_bias=False, dtype=dtype),

memory=memory,

probability_fn=wrapped_probability_fn,

memory_sequence_length=memory_sequence_length,

score_mask_value=score_mask_value,

name=name)

self._num_units = num_units

self._normalize = normalize

self._name = name

  • 现在理解了 _BaseAttentionMechanism 这个类中 query_layer 和 memory_layer 的意义了.

  • score_mask_value 沿用父类中的计算方式.

继续看 call 函数,也就是 attention 的计算方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

def __call__(self, query, state):

"""Score the query based on the keys and values.

Args:

query: Tensor of dtype matching `self.values` and shape

`[batch_size, query_depth]`.

state: Tensor of dtype matching `self.values` and shape

`[batch_size, alignments_size]`

(`alignments_size` is memory's `max_time`).

Returns:

alignments: Tensor of dtype matching `self.values` and shape

`[batch_size, alignments_size]` (`alignments_size` is memory's

`max_time`).

"""

with variable_scope.variable_scope(None, "bahdanau_attention", [query]):

processed_query = self.query_layer(query) if self.query_layer else query

score = _bahdanau_score(processed_query, self._keys, self._normalize)

alignments = self._probability_fn(score, state)

next_state = alignments

return alignments, next_state

然后看怎么计算的 score.

score = _bahdanau_score(processed_query, self._keys, self._normalize) 其中

processed_query 和 self._keys 都是通过全链接层后得到的, [batch, alignments_size, num_units]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

def _bahdanau_score(processed_query, keys, normalize):

"""Implements Bahdanau-style (additive) scoring function.

"""

dtype = processed_query.dtype

# Get the number of hidden units from the trailing dimension of keys

num_units = keys.shape[2].value or array_ops.shape(keys)[2]

# Reshape from [batch_size, ...] to [batch_size, 1, ...] for broadcasting.

processed_query = array_ops.expand_dims(processed_query, 1)

v = variable_scope.get_variable(

"attention_v", [num_units], dtype=dtype)

if normalize:

# Scalar used in weight normalization

g = variable_scope.get_variable(

"attention_g", dtype=dtype,

initializer=init_ops.constant_initializer(math.sqrt((1. / num_units))),

shape=())

# Bias added prior to the nonlinearity

b = variable_scope.get_variable(

"attention_b", [num_units], dtype=dtype,

initializer=init_ops.zeros_initializer())

# normed_v = g * v / ||v||

normed_v = g * v * math_ops.rsqrt(

math_ops.reduce_sum(math_ops.square(v)))

return math_ops.reduce_sum(

normed_v * math_ops.tanh(keys + processed_query + b), [2])

else:

return math_ops.reduce_sum(v * math_ops.tanh(keys + processed_query), [2])

源码中计算 score 的最后一步不是全链接,而是这样的:

1
2
3
4
5

v = tf.get_variable("attention_v", [num_units])

score = tf.reduce_sum(v * tanh(keys + processed_query), [2])

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

import tensorflow as tf

import numpy as np

import tensorflow.contrib.eager as tfe



tf.enable_eager_execution()

print(tfe.executing_eagerly())



memory = tf.ones(shape=[1, 10, 5]) # batch=1, max_sequence_len=10, embed_size=5

memory_sequence_len = [5] # 有效长度为 5

attention_mechnism = tf.contrib.seq2seq.BahdanauAttention(num_units=32, memory=memory,

memory_sequence_length=memory_sequence_len)

True
1
2
3

print(attention_mechnism.state_size, attention_mechnism.alignments_size)

10 10
1
2
3

memory

<tf.Tensor: id=3, shape=(1, 10, 5), dtype=float32, numpy=

array([[[1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.]]], dtype=float32)>
1
2
3

attention_mechnism.values # 可以发现 values 就是把 memory 中超过memory_sequence_length 的部分变为 0

<tf.Tensor: id=30, shape=(1, 10, 5), dtype=float32, numpy=

array([[[1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [1., 1., 1., 1., 1.],

        [0., 0., 0., 0., 0.],

        [0., 0., 0., 0., 0.],

        [0., 0., 0., 0., 0.],

        [0., 0., 0., 0., 0.],

        [0., 0., 0., 0., 0.]]], dtype=float32)>
1
2
3
4
5

print(attention_mechnism.keys.shape) # 经过了全链接之后的

attention_mechnism.keys.numpy()[0,1,:]

(1, 10, 32)



array([ 0.09100786,  0.18448338, -0.7751561 ,  0.00775184,  0.467805  ,

        0.9172474 ,  0.57645243, -0.3915946 , -0.22213435,  0.76866853,

        0.3591721 ,  0.8922573 ,  0.15866229,  0.6033571 ,  0.51816225,

        0.3820553 , -0.39130217,  0.04532939, -0.02089322,  0.6878175 ,

       -0.28697258,  0.59283376, -0.37825382, -0.5865691 ,  0.17466056,

       -0.5915747 ,  0.6070496 , -0.18531135, -0.821724  ,  1.2838829 ,

        0.15700272, -0.2608306 ], dtype=float32)
1
2
3

print(attention_mechnism.query_layer, attention_mechnism.memory_layer)

<tensorflow.python.layers.core.Dense object at 0x7fa0464da908> <tensorflow.python.layers.core.Dense object at 0x7fa0464dab38>
1
2
3
4
5
6
7
8
9

# 利用 call 函数来计算下一个 state 和 attention vector

query = tf.ones(shape=[1, 8]) # query_depth = 10

state_h0 = attention_mechnism.initial_alignments(batch_size=1, dtype=tf.float32)

attention_vector = attention_mechnism(query=query, state=state_h0)

1
2
3

print(attention_vector)

(<tf.Tensor: id=125, shape=(1, 10), dtype=float32, numpy=array([[0.2, 0.2, 0.2, 0.2, 0.2, 0. , 0. , 0. , 0. , 0. ]], dtype=float32)>, <tf.Tensor: id=125, shape=(1, 10), dtype=float32, numpy=array([[0.2, 0.2, 0.2, 0.2, 0.2, 0. , 0. , 0. , 0. , 0. ]], dtype=float32)>)

tf.contrib.seq2seq.LuongAttention

paper: Effective Approaches to Attention-based Neural Machine Translation, EMNLP 2015.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

class LuongAttention(_BaseAttentionMechanism):

"""Implements Luong-style (multiplicative) attention scoring.

"""

def __init__(self,

num_units,

memory,

memory_sequence_length=None,

scale=False,

probability_fn=None,

score_mask_value=None,

dtype=None,

name="LuongAttention"):

if probability_fn is None:

probability_fn = nn_ops.softmax

if dtype is None:

dtype = dtypes.float32

wrapped_probability_fn = lambda score, _: probability_fn(score)

super(LuongAttention, self).__init__(

query_layer=None,

memory_layer=layers_core.Dense(

num_units, name="memory_layer", use_bias=False, dtype=dtype),

memory=memory,

probability_fn=wrapped_probability_fn,

memory_sequence_length=memory_sequence_length,

score_mask_value=score_mask_value,

name=name)

self._num_units = num_units

self._scale = scale

self._name = name



可以发现 query 没有经过 query_layer 的处理,也就是没有全链接。但是 memory 还是要用全链接处理的,得到 [batch, max_times, num_units]

再看使用 call 函数计算对其概率 alignment 和 next_state.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

def __call__(self, query, state):

"""Score the query based on the keys and values.

Args:

query: Tensor of dtype matching `self.values` and shape

`[batch_size, query_depth]`.

state: Tensor of dtype matching `self.values` and shape

`[batch_size, alignments_size]`

(`alignments_size` is memory's `max_time`).

Returns:

alignments: Tensor of dtype matching `self.values` and shape

`[batch_size, alignments_size]` (`alignments_size` is memory's

`max_time`).

"""

with variable_scope.variable_scope(None, "luong_attention", [query]):

score = _luong_score(query, self._keys, self._scale)

alignments = self._probability_fn(score, state)

next_state = alignments

return alignments, next_state

接下来看怎么计算的 score

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91

def _luong_score(query, keys, scale):

"""Implements Luong-style (multiplicative) scoring function.

Args:

query: Tensor, shape `[batch_size, num_units]` to compare to keys.

keys: Processed memory, shape `[batch_size, max_time, num_units]`.

scale: Whether to apply a scale to the score function.

Returns:

A `[batch_size, max_time]` tensor of unnormalized score values.

Raises:

ValueError: If `key` and `query` depths do not match.

"""

depth = query.get_shape()[-1]

key_units = keys.get_shape()[-1]

if depth != key_units:

raise ValueError(

"Incompatible or unknown inner dimensions between query and keys. "

"Query (%s) has units: %s. Keys (%s) have units: %s. "

"Perhaps you need to set num_units to the keys' dimension (%s)?"

% (query, depth, keys, key_units, key_units))

dtype = query.dtype



# Reshape from [batch_size, depth] to [batch_size, 1, depth]

# for matmul.

query = array_ops.expand_dims(query, 1)



# Inner product along the query units dimension.

# matmul shapes: query is [batch_size, 1, depth] and

# keys is [batch_size, max_time, depth].

# the inner product is asked to **transpose keys' inner shape** to get a

# batched matmul on:

# [batch_size, 1, depth] . [batch_size, depth, max_time]

# resulting in an output shape of:

# [batch_size, 1, max_time].

# we then squeeze out the center singleton dimension.

score = math_ops.matmul(query, keys, transpose_b=True)

score = array_ops.squeeze(score, [1])



if scale:

# Scalar used in weight scaling

g = variable_scope.get_variable(

"attention_g", dtype=dtype,

initializer=init_ops.ones_initializer, shape=())

score = g * score

return score



通过源码可以发现 LuongAttention 调用 call 函数时,其 query 的维度必须是 num_units. 而 BahdanauAttention 并不需要。

其是计算 score 的方式如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

batch_size = 2

query_depth = num_units = 32

memory_depth = 15

max_times = 10

embed_size = 5

scale = True



query = tf.random_normal(shape=[batch_size, num_units])

# memory = tf.random_normal(shape=[batch_size, max_times, memory_depth])

# values = self._prepaer_memory(memory)

# keys = memory_layer(values)

values = tf.random_normal(shape=[batch_size, max_times, memory_depth])

keys = tf.layers.dense(inputs=values, units=num_units) # [batch, max_times, num_units]



query = tf.expand_dims(query, axis=1) # [batch, 1, num_units]

score = tf.matmul(query, keys, transpose_b=True) # [batch, 1, max_times]



score = tf.squeeze(score, axis=1) # [batch, max_times]

print(score.shape)

(2, 10)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

### 完整的过一遍

memory = tf.random_normal(shape=[batch_size, max_times, memory_depth])

memory_sequence_len = [5,8]

query_len = 5



query = tf.random_normal(shape=[batch_size, num_units])

state = tf.zeros(shape=[batch_size, max_times])

attention_mechnism = tf.contrib.seq2seq.LuongAttention(num_units=num_units,

memory=memory,

memory_sequence_length=memory_sequence_len)

attention_vector = attention_mechnism(query, state)

attention_vector[0], attention_vector[1] # attention_vector 和 state

(<tf.Tensor: id=1024, shape=(2, 10), dtype=float32, numpy=

 array([[3.6951914e-01, 5.4255807e-01, 2.4851409e-03, 1.8003594e-02,

         6.7433923e-02, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,

         0.0000000e+00, 0.0000000e+00],

        [3.5050314e-05, 1.2835214e-02, 1.6479974e-03, 1.4405438e-06,

         3.3324495e-01, 6.4109474e-01, 1.0316775e-02, 8.2380348e-04,

         0.0000000e+00, 0.0000000e+00]], dtype=float32)>,

 <tf.Tensor: id=1024, shape=(2, 10), dtype=float32, numpy=

 array([[3.6951914e-01, 5.4255807e-01, 2.4851409e-03, 1.8003594e-02,

         6.7433923e-02, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,

         0.0000000e+00, 0.0000000e+00],

        [3.5050314e-05, 1.2835214e-02, 1.6479974e-03, 1.4405438e-06,

         3.3324495e-01, 6.4109474e-01, 1.0316775e-02, 8.2380348e-04,

         0.0000000e+00, 0.0000000e+00]], dtype=float32)>)
1
2
3

tf.reduce_sum(attention_vector[0][1]).numpy()

1.0

这只是针对单个 query 的情况,但实际上 query 一般是这样的 [batch, query_len, num_units],那怎么办呢?

总结

最后总结一下

再看一遍两个 attention 初始化的差异

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

super(BahdanauAttention, self).__init__(

query_layer=layers_core.Dense(

num_units, name="query_layer", use_bias=False, dtype=dtype),

memory_layer=layers_core.Dense(

num_units, name="memory_layer", use_bias=False, dtype=dtype),

memory=memory,

probability_fn=wrapped_probability_fn,

memory_sequence_length=memory_sequence_length,

score_mask_value=score_mask_value,

name=name)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

super(LuongAttention, self).__init__(

query_layer=None,

memory_layer=layers_core.Dense(

num_units, name="memory_layer", use_bias=False, dtype=dtype),

memory=memory,

probability_fn=wrapped_probability_fn,

memory_sequence_length=memory_sequence_length,

score_mask_value=score_mask_value,

name=name)

作为一个类对象时,AttentionMechanismBahdanauAttentionLuongAttention它们具有如下属性:

  • query_layer: 在 BahdanauAttention 中一般是 tf.layer.dense 的实例对象,其维度是 num_units. 所以 BahdanauAttention 中 query 的维度可以是任意值。而 LuongAttention 中 query_layer 为 None,所以 query 的维度只能是 num_units.

  • memory_layer: 在两个 attention 中都是一样的,tf.layer.dense,且维度为 num_units.

  • alignments_size: 对齐size,是 memory 的 max_times.

  • batch_size: 批量大小

  • values: 是经过 mask 处理后的 memory. [batch, max_times, embed_size]

  • keys: 是经过 memory_layer 全链接处理后的。 [batch, max_times, num_units].

  • state_size: 等于 alignment_size.

然后是对应的方法:

init: 初始化类实例,里面的参数:

  • num_units: 在 Bahdanau 中这个参数其实是个中间值,将 query 和 keys 转化为这个维度,叠加,但最后还是要在这个维度上 reduce_sum. 但是在 LuongAttention 中它必须和 query 的维度一致,然后和 memory_layer 处理后的 memory 做矩阵相乘。

  • memory: [batch, max_times, embed_size]

  • normalize: 是佛有归一化

  • probability_fn: tf.nn.softmaxtf.contrib.seq2seq.hardmax tf.contrib.sparsemax.sparsemax

  • memory_sequence_length: 没有经过 padding 时 memory 的长度。其维度应该是 [1, batch_size]

call(query, state) 调用该实例

  • query: [batch_size, query_length]. 在 LuongAttention 中 query_length 必须等于 num_units.

  • state: [batch_size, alignments_size].

一直不太理解 state 有啥用?在源码中是用来计算 alignments 的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

alignments = self._probability_fn(score, state)



self._probability_fn = lambda score, prev: ( # pylint:disable=g-long-lambda

probability_fn(

_maybe_mask_score(score, memory_sequence_length, score_mask_value),

prev))



# 其中 score 是可能需要 mask 的. probability_fn 是 tf.nn.softmax. 所以呢????不需要 prev 啊?



# 然后发现确实不需要啊。。。一步步往上找



probability_fn=wrapped_probability_fn



wrapped_probability_fn = lambda score, _: probability_fn(score)





initial_alignments(batch_size, dtype) 初始化对齐

Args:

  • batch_size: int32 scalar, the batch_size.

  • dtype: The dtype.

Returns:

  • A dtype tensor shaped [batch_size, alignments_size]

initial_state(batch_size, dtype):

Creates the initial state values for the AttentionWrapper class.

  • batch_size: int32.

Tensorflow RNN API 源码阅读

在三星研究院实习一段时间发现在公司写代码和在学校还是有差别的。一是在公司要追求效率,会使用很多官方封装好的api,而在学校的时候因为要去理解内部原理,更多的是在造轮子,导致对很多 api 不是很熟悉。但实际上官方api不仅在速度,以及全面性上都比自己写的还是好很多的。二是,在公司对代码的复用率要求比较高,模型跑到哪一个版本了,对应的参数都要留下来,随时可以跑起来,而不是重新训练,这对模型、参数的保存要求很重要。以及在测试集上的性能指标都要在代码上很完整,而不是仅仅看看 loss 和 accuracy 就可以的。

这节内容主要是详细过一遍 tensorflow 里面的 rnn api,根据RNN and Cells (contrib)这里的顺序逐步深入研究

先回顾一下 RNN/LSTM/GRU:

参考之前 cs224d 的笔记

basic rnn:

$$h_t = \sigma(W_{hh}h_{t-1}+W_{hx}x_{|t|})$$

先看 tf.contrib.rnn.RNNCell

https://github.com/tensorflow/tensorflow/blob/4dcfddc5d12018a5a0fdca652b9221ed95e9eb23/tensorflow/python/ops/rnn_cell_impl.py#L170)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251

@tf_export("nn.rnn_cell.RNNCell")

class RNNCell(base_layer.Layer):

"""Abstract object representing an RNN cell.

Every `RNNCell` must have the properties below and implement `call` with

the signature `(output, next_state) = call(input, state)`.



RNNCell 是一个抽象的父类,之后更复杂的 RNN/LSTM/GRU 都是重新实现 call 函数,也就是更新隐藏状态

的方式改变了。



The optional third input argument, `scope`, is allowed for backwards compatibility

purposes; but should be left off for new subclasses.



scope 这个参数管理变量,在反向传播中变量是否可训练。



This definition of cell differs from the definition used in the literature.

In the literature, 'cell' refers to an object with a single scalar output.

This definition refers to a horizontal array of such units.



这里的 cell 的概念和一些论文中是不一样的。在论文中,cell 表示一个神经元,也就是单个值。而这里表示的是

一组神经元,比如隐藏状态[batch, num_units].



An RNN cell, in the most abstract setting, is anything that has

a state and performs some operation that takes a matrix of inputs.

This operation results in an output matrix with `self.output_size` columns.

If `self.state_size` is an integer, this operation also results in a new

state matrix with `self.state_size` columns. If `self.state_size` is a

(possibly nested tuple of) TensorShape object(s), then it should return a

matching structure of Tensors having shape `[batch_size].concatenate(s)`

for each `s` in `self.batch_size`.



rnn cell 的输入是一个状态 state 和 input 矩阵,参数有 self.output_size 和 self.state_size.

分别表示输出层和隐藏层的维度。其中 state_size 可能是 tuple,这个之后在看。

"""



def __call__(self, inputs, state, scope=None):

"""Run this RNN cell on inputs, starting from the given state.

Args:

inputs: `2-D` tensor with shape `[batch_size, input_size]`.

state: if `self.state_size` is an integer, this should be a `2-D Tensor`

with shape `[batch_size, self.state_size]`. Otherwise, if

`self.state_size` is a tuple of integers, this should be a tuple

with shapes `[batch_size, s] for s in self.state_size`.

scope: VariableScope for the created subgraph; defaults to class name.

Returns:

A pair containing:

- Output: A `2-D` tensor with shape `[batch_size, self.output_size]`.

- New state: Either a single `2-D` tensor, or a tuple of tensors matching

the arity and shapes of `state`.

"""

if scope is not None:

with vs.variable_scope(scope,

custom_getter=self._rnn_get_variable) as scope:

return super(RNNCell, self).__call__(inputs, state, scope=scope)

else:

scope_attrname = "rnncell_scope"

scope = getattr(self, scope_attrname, None)

if scope is None:

scope = vs.variable_scope(vs.get_variable_scope(),

custom_getter=self._rnn_get_variable)

setattr(self, scope_attrname, scope)

with scope:

return super(RNNCell, self).__call__(inputs, state)



def _rnn_get_variable(self, getter, *args, **kwargs):

variable = getter(*args, **kwargs)

if context.executing_eagerly():

trainable = variable._trainable # pylint: disable=protected-access

else:

trainable = (

variable in tf_variables.trainable_variables() or

(isinstance(variable, tf_variables.PartitionedVariable) and

list(variable)[0] in tf_variables.trainable_variables()))

if trainable and variable not in self._trainable_weights:

self._trainable_weights.append(variable)

elif not trainable and variable not in self._non_trainable_weights:

self._non_trainable_weights.append(variable)

return variable



@property

def state_size(self):

"""size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers

or TensorShapes.

"""

raise NotImplementedError("Abstract method")



@property

def output_size(self):

"""Integer or TensorShape: size of outputs produced by this cell."""

raise NotImplementedError("Abstract method")



def build(self, _):

# This tells the parent Layer object that it's OK to call

# self.add_variable() inside the call() method.

pass



def zero_state(self, batch_size, dtype):

"""Return zero-filled state tensor(s).

Args:

batch_size: int, float, or unit Tensor representing the batch size.

dtype: the data type to use for the state.

Returns:

If `state_size` is an int or TensorShape, then the return value is a

`N-D` tensor of shape `[batch_size, state_size]` filled with zeros.

If `state_size` is a nested list or tuple, then the return value is

a nested list or tuple (of the same structure) of `2-D` tensors with

the shapes `[batch_size, s]` for each s in `state_size`.

"""

# Try to use the last cached zero_state. This is done to avoid recreating

# zeros, especially when eager execution is enabled.

state_size = self.state_size

is_eager = context.executing_eagerly()

if is_eager and hasattr(self, "_last_zero_state"):

(last_state_size, last_batch_size, last_dtype,

last_output) = getattr(self, "_last_zero_state")

if (last_batch_size == batch_size and

last_dtype == dtype and

last_state_size == state_size):

return last_output

with ops.name_scope(type(self).__name__ + "ZeroState", values=[batch_size]):

output = _zero_state_tensors(state_size, batch_size, dtype)

if is_eager:

self._last_zero_state = (state_size, batch_size, dtype, output)

return output



两个属性 output_size, state_size 分别表示输出层的维度和隐藏层的维度。call 函数用来表示计算下一个时间步的隐藏状态和输出,zero_state 函数用来初始化初始状态全为 0, 这里 state_size 有两种情况,一种是 int 或 tensorshape,那么 [batch, state_size]. 如果是多层嵌套 rnn, 那么初始状态 [batch, s] for s in state_size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

class LayerRNNCell(RNNCell):

"""Subclass of RNNCells that act like proper `tf.Layer` objects.



def __call__(self, inputs, state, scope=None, *args, **kwargs):

"""Run this RNN cell on inputs, starting from the given state.

Args:

inputs: `2-D` tensor with shape `[batch_size, input_size]`.

state: if `self.state_size` is an integer, this should be a `2-D Tensor`

with shape `[batch_size, self.state_size]`. Otherwise, if

`self.state_size` is a tuple of integers, this should be a tuple

with shapes `[batch_size, s] for s in self.state_size`.

scope: optional cell scope.

*args: Additional positional arguments.

**kwargs: Additional keyword arguments.

Returns:

A pair containing:

- Output: A `2-D` tensor with shape `[batch_size, self.output_size]`.

- New state: Either a single `2-D` tensor, or a tuple of tensors matching

the arity and shapes of `state`.

"""

# Bypass RNNCell's variable capturing semantics for LayerRNNCell.

# Instead, it is up to subclasses to provide a proper build

# method. See the class docstring for more details.

return base_layer.Layer.__call__(self, inputs, state, scope=scope,

*args, **kwargs)

再看 Core RNN Cells

  • tf.contrib.rnn.BasicRNNCell

  • tf.contrib.rnn.BasicLSTMCell

  • tf.contrib.rnn.GRUCell

  • tf.contrib.rnn.LSTMCell

  • tf.contrib.rnn.LayerNormBasicLSTMCell

tf.contrib.rnn.BasicRNNCell

直接扒源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129

@tf_export("nn.rnn_cell.BasicRNNCell")

class BasicRNNCell(LayerRNNCell):

"""The most basic RNN cell.

Args:

num_units: int, The number of units in the RNN cell.

activation: Nonlinearity to use. Default: `tanh`.

reuse: (optional) Python boolean describing whether to reuse variables

in an existing scope. If not `True`, and the existing scope already has

the given variables, an error is raised.

name: String, the name of the layer. Layers with the same name will

share weights, but to avoid mistakes we require reuse=True in such

cases.

dtype: Default dtype of the layer (default of `None` means use the type

of the first input). Required when `build` is called before `call`.

"""



def __init__(self,

num_units,

activation=None,

reuse=None,

name=None,

dtype=None):

super(BasicRNNCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)



# Inputs must be 2-dimensional.

self.input_spec = base_layer.InputSpec(ndim=2)



self._num_units = num_units

self._activation = activation or math_ops.tanh



@property

def state_size(self):

return self._num_units



@property

def output_size(self):

return self._num_units



def build(self, inputs_shape):

if inputs_shape[1].value is None:

raise ValueError("Expected inputs.shape[-1] to be known, saw shape: %s"

% inputs_shape)



input_depth = inputs_shape[1].value

self._kernel = self.add_variable(

_WEIGHTS_VARIABLE_NAME,

shape=[input_depth + self._num_units, self._num_units])

self._bias = self.add_variable(

_BIAS_VARIABLE_NAME,

shape=[self._num_units],

initializer=init_ops.zeros_initializer(dtype=self.dtype))



self.built = True



def call(self, inputs, state):

"""Most basic RNN: output = new_state = act(W * input + U * state + B)."""



gate_inputs = math_ops.matmul(

array_ops.concat([inputs, state], 1), self._kernel)

gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)

output = self._activation(gate_inputs)

return output, output





可以发现 state_size = output_size = num_units, 以及输出就是下一个隐藏状态 output = new_state = act(W * input + U * state + B) = act(W[input, state] + b) 其中 self._kernel 表示 W, 其维度是 [input_depth + num_units, num_units]

1
2
3
4
5
6
7
8
9
10
11

import tensorflow as tf

import tensorflow.contrib.eager as tfe



tf.enable_eager_execution()

print(tfe.executing_eagerly())

True
1
2
3
4
5

cell = tf.contrib.rnn.BasicRNNCell(num_units=128, activation=None)

print(cell.state_size, cell.output_size)

128 128
1
2
3
4
5
6
7
8
9

inputs = tf.random_normal(shape=[32, 100], dtype=tf.float32)

h0 = cell.zero_state(batch_size=32, dtype=tf.float32)

output, state = cell(inputs=inputs, state=h0)

print(output.shape, state.shape)

(32, 128) (32, 128)

tf.contrib.rnn.BasicLSTMCell

先回顾下 LSTM:

自己试着手敲公式~ 看着图还是简单,不看图是否也可以呢?

三个gate:遗忘门,输入/更新门,输出门

$$f_t=\sigma(W^{f}x_t + U^{f}h_{t-1})$$

$$i_t=\sigma(W^{i}x_t + U^{i}h_{t-1})$$

$$o_t=\sigma(W^{o}x_t + U^{o}h_{t-1})$$

new memory cell:

$$\hat c=tanh(W^cx_t + U^ch_{t-1})$$

输入门作用于新的记忆细胞,遗忘门作用于上一个记忆细胞,并得到最终的记忆细胞:

$$c_t=f_t\circ c_{t-1} + i_t\circ\hat c$$

用新的memory cell 和输出门得到新的隐藏状态:

$$h_t = tanh(o_t\circ c_t)$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245

class BasicLSTMCell(LayerRNNCell):

"""Basic LSTM recurrent network cell.

The implementation is based on: http://arxiv.org/abs/1409.2329.

We add forget_bias (default: 1) to the biases of the forget gate in order to

reduce the scale of forgetting in the beginning of the training.

It does not allow cell clipping, a projection layer, and does not

use peep-hole connections: it is the basic baseline.

For advanced models, please use the full @{tf.nn.rnn_cell.LSTMCell}

that follows.

"""



def __init__(self,

num_units,

forget_bias=1.0,

state_is_tuple=True,

activation=None,

reuse=None,

name=None,

dtype=None):

"""Initialize the basic LSTM cell.

Args:

num_units: int, The number of units in the LSTM cell.

forget_bias: float, The bias added to forget gates (see above).

Must set to `0.0` manually when restoring from CudnnLSTM-trained

checkpoints.

state_is_tuple: If True, accepted and returned states are 2-tuples of

the `c_state` and `m_state`. If False, they are concatenated

along the column axis. The latter behavior will soon be deprecated.

activation: Activation function of the inner states. Default: `tanh`.

reuse: (optional) Python boolean describing whether to reuse variables

in an existing scope. If not `True`, and the existing scope already has

the given variables, an error is raised.

name: String, the name of the layer. Layers with the same name will

share weights, but to avoid mistakes we require reuse=True in such

cases.

dtype: Default dtype of the layer (default of `None` means use the type

of the first input). Required when `build` is called before `call`.

When restoring from CudnnLSTM-trained checkpoints, must use

`CudnnCompatibleLSTMCell` instead.

"""

super(BasicLSTMCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)

if not state_is_tuple:

logging.warn("%s: Using a concatenated state is slower and will soon be "

"deprecated. Use state_is_tuple=True.", self)



# Inputs must be 2-dimensional.

self.input_spec = base_layer.InputSpec(ndim=2)



self._num_units = num_units

self._forget_bias = forget_bias

self._state_is_tuple = state_is_tuple

self._activation = activation or math_ops.tanh



@property

def state_size(self):

return (LSTMStateTuple(self._num_units, self._num_units)

if self._state_is_tuple else 2 * self._num_units)



@property

def output_size(self):

return self._num_units



def build(self, inputs_shape):

if inputs_shape[1].value is None:

raise ValueError("Expected inputs.shape[-1] to be known, saw shape: %s"

% inputs_shape)



input_depth = inputs_shape[1].value

h_depth = self._num_units

self._kernel = self.add_variable(

_WEIGHTS_VARIABLE_NAME,

shape=[input_depth + h_depth, 4 * self._num_units])

self._bias = self.add_variable(

_BIAS_VARIABLE_NAME,

shape=[4 * self._num_units],

initializer=init_ops.zeros_initializer(dtype=self.dtype))



self.built = True



def call(self, inputs, state):

"""Long short-term memory cell (LSTM).

Args:

inputs: `2-D` tensor with shape `[batch_size, input_size]`.

state: An `LSTMStateTuple` of state tensors, each shaped

`[batch_size, num_units]`, if `state_is_tuple` has been set to

`True`. Otherwise, a `Tensor` shaped

`[batch_size, 2 * num_units]`.

Returns:

A pair containing the new hidden state, and the new state (either a

`LSTMStateTuple` or a concatenated state, depending on

`state_is_tuple`).

"""

sigmoid = math_ops.sigmoid

one = constant_op.constant(1, dtype=dtypes.int32)

# Parameters of gates are concatenated into one multiply for efficiency.

if self._state_is_tuple:

c, h = state

else:

c, h = array_ops.split(value=state, num_or_size_splits=2, axis=one)



gate_inputs = math_ops.matmul(

array_ops.concat([inputs, h], 1), self._kernel)

gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)



# i = input_gate, j = new_input, f = forget_gate, o = output_gate

i, j, f, o = array_ops.split(

value=gate_inputs, num_or_size_splits=4, axis=one)



forget_bias_tensor = constant_op.constant(self._forget_bias, dtype=f.dtype)

# Note that using `add` and `multiply` instead of `+` and `*` gives a

# performance improvement. So using those at the cost of readability.

add = math_ops.add

multiply = math_ops.multiply

new_c = add(multiply(c, sigmoid(add(f, forget_bias_tensor))),

multiply(sigmoid(i), self._activation(j)))

new_h = multiply(self._activation(new_c), sigmoid(o))



if self._state_is_tuple:

new_state = LSTMStateTuple(new_c, new_h)

else:

new_state = array_ops.concat([new_c, new_h], 1)

return new_h, new_state

阅读源码可以发现具体实现与上面的公式还是有点差别的。

  • 先 concat[input, h], 然后 gate_input = matmul(concat[input, h], self._kernel)+self._bias,多了偏置项,这里的矩阵维度 [input_depth + h_depth,4*num_units]. 然后 i,j,f,o = split(gate_input, 4, axis=1). 其中 j 表示 new memory cell. 然后计算 new_c,其中 i,f,o 对应的激活函数确定是 sigmoid,因为其范围只能在(0,1)之间。但是 j 的激活函数self._activation 可以选择,默认是 tanh.

  • 与公式的差别之二在于 self._forget_bias.遗忘门在激活函数 $\sigma$ 之前加了偏置,目的是避免在训练初期丢失太多信息。

  • 要注意 state 的形式,取决于参数 self._state_is_tuple. 其中 c,h=state,表示 $c_{t-1},h_{t-1}$

1
2
3

lstm_cell = tf.contrib.rnn.BasicLSTMCell(num_units=128, forget_bias=1.0, state_is_tuple=True)

WARNING:tensorflow:From <ipython-input-9-3f4ca183c5d7>:1: BasicLSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.

Instructions for updating:

This class is deprecated, please use tf.nn.rnn_cell.LSTMCell, which supports all the feature this cell currently has. Please replace the existing code with tf.nn.rnn_cell.LSTMCell(name='basic_lstm_cell').

提示要更新了,那就换成最新的吧

1
2
3

lstm_cell = tf.nn.rnn_cell.LSTMCell(num_units=128, forget_bias=1.0, state_is_tuple=True)

1
2
3

lstm_cell.output_size, lstm_cell.state_size

(128, LSTMStateTuple(c=128, h=128))
1
2
3

h0 = lstm_cell.zero_state(batch_size=30, dtype=tf.float32)

我们发现 lstm 的状态 state 是一个tuple,分别对应 c_t 和 h_t.

1
2
3
4
5
6
7
8
9

class LSTMStateTuple(_LSTMStateTuple):

"""Tuple used by LSTM Cells for `state_size`, `zero_state`, and output state.

Stores two elements: `(c, h)`, in that order. Where `c` is the hidden state

and `h` is the output.

这里的解释感觉是有点问题的,c is the hidden state and h is the output. 看源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

new_c = add(multiply(c, sigmoid(add(f, forget_bias_tensor))),

multiply(sigmoid(i), self._activation(j)))

new_h = multiply(self._activation(new_c), sigmoid(o))



if self._state_is_tuple:

new_state = LSTMStateTuple(new_c, new_h)

else:

new_state = array_ops.concat([new_c, new_h], 1)

return new_h, new_state

发现 c 表示的就是 new memory cell, 而 h 表示的是最后的隐藏状态。

1
2
3
4
5
6
7

# 计算下一步的 output 和 state

inputs = tf.random_normal(shape=[30, 100], dtype=tf.float32)

output, state = lstm_cell(inputs, h0)

1
2
3

output.shape, state[0].shape, state[1].shape

(TensorShape([Dimension(30), Dimension(128)]),

 TensorShape([Dimension(30), Dimension(128)]),

 TensorShape([Dimension(30), Dimension(128)]))
1
2
3

state.c, state.h # c 和 h 的值是不一样的

(<tf.Tensor: id=108, shape=(30, 128), dtype=float32, numpy=

 array([[ 0.08166471,  0.14020835,  0.07970127, ..., -0.1540019 ,

          0.38848224, -0.0842322 ],

        [-0.03643086, -0.20558938,  0.1503458 , ...,  0.01846285,

          0.15610473,  0.04408235],

        [-0.0933667 ,  0.03454542, -0.09073547, ..., -0.12701994,

         -0.34669587,  0.09373946],

        ...,

        [-0.00752909,  0.22412673, -0.270195  , ...,  0.09341058,

         -0.20986181, -0.18622127],

        [ 0.18778914,  0.37687936, -0.24727295, ..., -0.06409463,

          0.00218048,  0.5940756 ],

        [ 0.04073388, -0.08431841,  0.35944715, ...,  0.14135318,

          0.08472287, -0.11058106]], dtype=float32)>,

 <tf.Tensor: id=111, shape=(30, 128), dtype=float32, numpy=

 array([[ 0.04490132,  0.07412361,  0.03662094, ..., -0.07611651,

          0.17290959, -0.0277745 ],

        [-0.02212535, -0.13554382,  0.08272093, ...,  0.00918258,

          0.0861209 ,  0.02614526],

        [-0.05723168,  0.01372226, -0.02919216, ..., -0.06374882,

         -0.1918035 ,  0.03912015],

        ...,

        [-0.00377504,  0.15181372, -0.14555399, ...,  0.06073361,

         -0.09804281, -0.07492835],

        [ 0.10244624,  0.17440473, -0.09896267, ..., -0.03794969,

          0.00123257,  0.21985768],

        [ 0.01832823, -0.03795732,  0.1654894 , ...,  0.05827027,

          0.02769112, -0.05957894]], dtype=float32)>)

tf.nn.rnn_cell.GRUCell

先回顾下 GRU.

手敲 GRU 公式:

$$r_t=\sigma(W^rx_t + U^rh_{t-1})$$

$$z_t=\sigma(W^zx_t + U^zh_{t-1})$$

$$\tilde h_t = tanh(Wx_t + r_t\circ h_{t-1})$$

$$h_t=(1-z_t)\circ\tilde h_t + z_t\circ h_{t-1}$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197

@tf_export("nn.rnn_cell.GRUCell")

class GRUCell(LayerRNNCell):

"""Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078).

Args:

num_units: int, The number of units in the GRU cell.

activation: Nonlinearity to use. Default: `tanh`.

reuse: (optional) Python boolean describing whether to reuse variables

in an existing scope. If not `True`, and the existing scope already has

the given variables, an error is raised.

kernel_initializer: (optional) The initializer to use for the weight and

projection matrices.

bias_initializer: (optional) The initializer to use for the bias.

name: String, the name of the layer. Layers with the same name will

share weights, but to avoid mistakes we require reuse=True in such

cases.

dtype: Default dtype of the layer (default of `None` means use the type

of the first input). Required when `build` is called before `call`.

"""



def __init__(self,

num_units,

activation=None,

reuse=None,

kernel_initializer=None,

bias_initializer=None,

name=None,

dtype=None):

super(GRUCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)



# Inputs must be 2-dimensional.

self.input_spec = base_layer.InputSpec(ndim=2)



self._num_units = num_units

self._activation = activation or math_ops.tanh

self._kernel_initializer = kernel_initializer

self._bias_initializer = bias_initializer



@property

def state_size(self):

return self._num_units



@property

def output_size(self):

return self._num_units



def build(self, inputs_shape):

if inputs_shape[1].value is None:

raise ValueError("Expected inputs.shape[-1] to be known, saw shape: %s"

% inputs_shape)



input_depth = inputs_shape[1].value

self._gate_kernel = self.add_variable(

"gates/%s" % _WEIGHTS_VARIABLE_NAME,

shape=[input_depth + self._num_units, 2 * self._num_units],

initializer=self._kernel_initializer)

self._gate_bias = self.add_variable(

"gates/%s" % _BIAS_VARIABLE_NAME,

shape=[2 * self._num_units],

initializer=(

self._bias_initializer

if self._bias_initializer is not None

else init_ops.constant_initializer(1.0, dtype=self.dtype)))

self._candidate_kernel = self.add_variable(

"candidate/%s" % _WEIGHTS_VARIABLE_NAME,

shape=[input_depth + self._num_units, self._num_units],

initializer=self._kernel_initializer)

self._candidate_bias = self.add_variable(

"candidate/%s" % _BIAS_VARIABLE_NAME,

shape=[self._num_units],

initializer=(

self._bias_initializer

if self._bias_initializer is not None

else init_ops.zeros_initializer(dtype=self.dtype)))



self.built = True



def call(self, inputs, state):

"""Gated recurrent unit (GRU) with nunits cells."""



gate_inputs = math_ops.matmul(

array_ops.concat([inputs, state], 1), self._gate_kernel)

gate_inputs = nn_ops.bias_add(gate_inputs, self._gate_bias)



value = math_ops.sigmoid(gate_inputs)

r, u = array_ops.split(value=value, num_or_size_splits=2, axis=1)



r_state = r * state



candidate = math_ops.matmul(

array_ops.concat([inputs, r_state], 1), self._candidate_kernel)

candidate = nn_ops.bias_add(candidate, self._candidate_bias)



c = self._activation(candidate)

new_h = u * state + (1 - u) * c

return new_h, new_h





_LSTMStateTuple = collections.namedtuple("LSTMStateTuple", ("c", "h"))

仔细阅读源码发现在 $sigma$ 计算 gate,以及 tanh 计算 candidate 之前都有偏置项,不过公式中都没写出来。而且在不设置 bias 的初始值时,默认的 GRU 中 gate_bias 的初始值是 1, 而 LSTM 中 gate_bias 的初始值是 0.

1
2
3

gru_cell = tf.nn.rnn_cell.GRUCell(num_units=128)

1
2
3

gru_cell.state_size, gru_cell.output_size

(128, 128)
1
2
3

h0 = gru_cell.zero_state(batch_size=30, dtype=tf.float32)

1
2
3
4
5
6
7

inputs = tf.random_normal(shape=[30, 100], dtype=tf.float32)

output, state = gru_cell(inputs, h0)

output.shape, state.shape

(TensorShape([Dimension(30), Dimension(128)]),

 TensorShape([Dimension(30), Dimension(128)]))

出现个很神奇的现象,如果我写成这样:

1
2
3
4
5

output, state = gru_cell.call(inputs, h0) # 会报错的,gru_cell 没有 self._gate_kernel 这个属性,很神奇,

# 不过这里先运行上面那行代码,所以没有出现报错

tf.nn.rnn_cell.LSTMCell, tf.contrib.rnn.LSTMCell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

@tf_export("nn.rnn_cell.LSTMCell")

class LSTMCell(LayerRNNCell):

"""Long short-term memory unit (LSTM) recurrent network cell.

The default non-peephole implementation is based on:

http://www.bioinf.jku.at/publications/older/2604.pdf

S. Hochreiter and J. Schmidhuber.

"Long Short-Term Memory". Neural Computation, 9(8):1735-1780, 1997.

The peephole implementation is based on:

https://research.google.com/pubs/archive/43905.pdf

Hasim Sak, Andrew Senior, and Francoise Beaufays.

"Long short-term memory recurrent neural network architectures for

large scale acoustic modeling." INTERSPEECH, 2014.

The class uses optional peep-hole connections, optional cell clipping, and

an optional projection layer.

"""



def __init__(self, num_units,

use_peepholes=False, cell_clip=None,

initializer=None, num_proj=None, proj_clip=None,

num_unit_shards=None, num_proj_shards=None,

forget_bias=1.0, state_is_tuple=True,

activation=None, reuse=None, name=None, dtype=None):

"""Initialize the parameters for an LSTM cell.

"""

super(LSTMCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)

相比 BasicLSTMCell 多了这四个参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

Args:

use_peepholes: bool, set True to enable diagonal/peephole connections.

cell_clip: (optional) A float value, if provided the cell state is clipped

by this value prior to the cell output activation.

num_proj: (optional) int, The output dimensionality for the projection

matrices. If None, no projection is performed.

proj_clip: (optional) A float value. If `num_proj > 0` and `proj_clip` is

provided, then the projected values are clipped elementwise to within

`[-proj_clip, proj_clip]`.

````



其中 cell_clip 很好理解,就是限制隐藏状态的大小,也就是 output 和 state 的大小。 而 num_proj 呢?

```python

if num_proj:

self._state_size = (

LSTMStateTuple(num_units, num_proj)

if state_is_tuple else num_units + num_proj)

self._output_size = num_proj

else:

self._state_size = (

LSTMStateTuple(num_units, num_units)

if state_is_tuple else 2 * num_units)

self._output_size = num_units

````



通过源码可以发现,如果有 num_proj 那么 state 还要加一个全链接, state_size = num_units + num_proj. 而 proj_clip 是限制这个全链接的输出的。



BacisLSTMCell 和 LSTMCell 区别还在于后者增加了 peephole 和 cell_clip

![](https://img-blog.csdn.net/20171201095120010?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvYWJjbGhxMjAwNQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)



输入输出的shape,以及状态都是一样的,只不过内部计算方式更加复杂了。对于 peephole 是值得计算 gate 时,考虑到了 $c_{t-1}$ 和 $c_t$.





```python

cell = tf.nn.rnn_cell.LSTMCell(num_units=64,cell_clip=0.000000001, num_proj=128, proj_clip=0.001)

1
2
3

cell.state_size, cell.output_size

(LSTMStateTuple(c=64, h=128), 128)

发现 state_size 中的 h 维度发生了变化,相当于在每一个时间步得到的 state.h 之后再添加一个全链接。 在 decoder 中可以将 num_proj 设置为词表的大小,那么输出就是对应时间步的词表的分布。在此基础上在 softmax 就可以吧? 但是下一个时间步的输入的隐藏状态 $h_{t-1}$ 岂不是维度为词表大小。。。感觉最好还是不用这个参数吧

1
2
3
4
5

h0 = cell.zero_state(batch_size=30, dtype=tf.float32)

h0.c.shape, h0.h.shape

(TensorShape([Dimension(30), Dimension(64)]),

 TensorShape([Dimension(30), Dimension(128)]))
1
2
3

inputs = tf.ones(shape=[30,50])

1
2
3

output, state = cell(inputs=inputs, state=h0)

1
2
3

output.shape, state.c.shape, state.h.shape

(TensorShape([Dimension(30), Dimension(128)]),

 TensorShape([Dimension(30), Dimension(64)]),

 TensorShape([Dimension(30), Dimension(128)]))

封装了 RNN 的其他组件

Core RNN Cell wrappers (RNNCells that wrap other RNNCells)

  • tf.contrib.rnn.MultiRNNCell

  • tf.contrib.rnn.LSTMBlockWrapper

  • tf.contrib.rnn.DropoutWrapper

  • tf.contrib.rnn.EmbeddingWrapper

  • tf.contrib.rnn.InputProjectionWrapper

  • tf.contrib.rnn.OutputProjectionWrapper

  • tf.contrib.rnn.DeviceWrapper

  • tf.contrib.rnn.ResidualWrapper

主要看 tf.contrib.rnn.MultiRNNCelltf.contrib.rnn.DropoutWrapper吧,其他的封装的太好了也不好,用的其实也少。

tf.contrib.rnn.MultiRNNCell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165

class MultiRNNCell(RNNCell):

"""

RNN cell composed sequentially of multiple simple cells.

"""



def __init__(self, cells, state_is_tuple=True):

"""Create a RNN cell composed sequentially of a number of RNNCells.

Args:

cells: list of RNNCells that will be composed in this order.

state_is_tuple: If True, accepted and returned states are n-tuples, where

`n = len(cells)`. If False, the states are all

concatenated along the column axis. This latter behavior will soon be

deprecated.

Raises:

ValueError: if cells is empty (not allowed), or at least one of the cells

returns a state tuple but the flag `state_is_tuple` is `False`.

"""

super(MultiRNNCell, self).__init__()

if not cells:

raise ValueError("Must specify at least one cell for MultiRNNCell.")

if not nest.is_sequence(cells):

raise TypeError(

"cells must be a list or tuple, but saw: %s." % cells)



self._cells = cells

for cell_number, cell in enumerate(self._cells):

# Add Checkpointable dependencies on these cells so their variables get

# saved with this object when using object-based saving.

if isinstance(cell, checkpointable.CheckpointableBase):

# TODO(allenl): Track down non-Checkpointable callers.

self._track_checkpointable(cell, name="cell-%d" % (cell_number,))

self._state_is_tuple = state_is_tuple

if not state_is_tuple:

if any(nest.is_sequence(c.state_size) for c in self._cells):

raise ValueError("Some cells return tuples of states, but the flag "

"state_is_tuple is not set. State sizes are: %s"

% str([c.state_size for c in self._cells]))



@property

def state_size(self):

if self._state_is_tuple:

return tuple(cell.state_size for cell in self._cells)

else:

return sum([cell.state_size for cell in self._cells])



@property

def output_size(self):

return self._cells[-1].output_size



def zero_state(self, batch_size, dtype):

with ops.name_scope(type(self).__name__ + "ZeroState", values=[batch_size]):

if self._state_is_tuple:

return tuple(cell.zero_state(batch_size, dtype) for cell in self._cells)

else:

# We know here that state_size of each cell is not a tuple and

# presumably does not contain TensorArrays or anything else fancy

return super(MultiRNNCell, self).zero_state(batch_size, dtype)



def call(self, inputs, state):

"""Run this multi-layer cell on inputs, starting from state."""

cur_state_pos = 0

cur_inp = inputs

new_states = []

for i, cell in enumerate(self._cells):

with vs.variable_scope("cell_%d" % i):

if self._state_is_tuple:

if not nest.is_sequence(state):

raise ValueError(

"Expected state to be a tuple of length %d, but received: %s" %

(len(self.state_size), state))

cur_state = state[i]

else:

cur_state = array_ops.slice(state, [0, cur_state_pos],

[-1, cell.state_size])

cur_state_pos += cell.state_size

cur_inp, new_state = cell(cur_inp, cur_state)

new_states.append(new_state)



new_states = (tuple(new_states) if self._state_is_tuple else

array_ops.concat(new_states, 1))



return cur_inp, new_states

参数 cell 是元素为 RNNCell对象 的 list 或 tuple. 其实还是只是计算一个时间步的 state

这里先不考虑双向,只考虑 deep. 也就是使用 MultiRNNCell

1
2
3
4
5
6
7

num_units = [64, 128]

stack_rnns = [tf.nn.rnn_cell.BasicLSTMCell(num_units=i) for i in num_units]

stack_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(stack_rnns)

1
2
3

h0 = [cell.zero_state(batch_size=32, dtype=tf.float32) for cell in stack_rnn]

1
2
3
4
5

inputs = tf.random_normal(shape=[32, 100], dtype=tf.float32)

output, state = stack_rnn_cell(inputs=inputs, state=h0)

1
2
3

output.shape

TensorShape([Dimension(32), Dimension(128)])
1
2
3

state[0].c.shape, state[0].h.shape

(TensorShape([Dimension(32), Dimension(64)]),

 TensorShape([Dimension(32), Dimension(64)]))
1
2
3

state[1].c.shape, state[1].h.shape

(TensorShape([Dimension(32), Dimension(128)]),

 TensorShape([Dimension(32), Dimension(128)]))

源码中的一部分:

1
2
3
4
5

cur_inp, new_state = cell(cur_inp, cur_state)

new_states.append(new_state)

其中从源码中也可以发现把每一层的 state 都储存起来了,而 output 要作为下一层的输入,最后得到的 output 是最后一层的输出。

tf.contrib.rnn.DropoutWrapper

参考paper: A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141

@tf_export("nn.rnn_cell.DropoutWrapper")

class DropoutWrapper(RNNCell):

"""Operator adding dropout to inputs and outputs of the given cell."""



def __init__(self, cell, input_keep_prob=1.0, output_keep_prob=1.0,

state_keep_prob=1.0, variational_recurrent=False,

input_size=None, dtype=None, seed=None,

dropout_state_filter_visitor=None):

"""Create a cell with added input, state, and/or output dropout.

If `variational_recurrent` is set to `True` (**NOT** the default behavior),

then the same dropout mask is applied at every step, as described in:

Y. Gal, Z Ghahramani. "A Theoretically Grounded Application of Dropout in

Recurrent Neural Networks". https://arxiv.org/abs/1512.05287



如果参数 variational_recurrent 设置为 True,那么 dropout 在每一个时间步都会执行 dropout,





Otherwise a different dropout mask is applied at every time step.

Note, by default (unless a custom `dropout_state_filter` is provided),

the memory state (`c` component of any `LSTMStateTuple`) passing through

a `DropoutWrapper` is never modified. This behavior is described in the

above article.

Args:

cell: an RNNCell, a projection to output_size is added to it.





input_keep_prob: unit Tensor or float between 0 and 1, input keep

probability; if it is constant and 1, no input dropout will be added.

output_keep_prob: unit Tensor or float between 0 and 1, output keep

probability; if it is constant and 1, no output dropout will be added.

state_keep_prob: unit Tensor or float between 0 and 1, output keep

probability; if it is constant and 1, no output dropout will be added.

State dropout is performed on the outgoing states of the cell.

**Note** the state components to which dropout is applied when

`state_keep_prob` is in `(0, 1)` are also determined by

the argument `dropout_state_filter_visitor` (e.g. by default dropout

is never applied to the `c` component of an `LSTMStateTuple`).



上面三个参数分别表示 input,output,state 是否 dropout,以及 dropout 率。



variational_recurrent: Python bool. If `True`, then the same

dropout pattern is applied across all time steps per run call.

If this parameter is set, `input_size` **must** be provided.



这个参数如果为 True,那么每一个时间步都需要 dropout.



input_size: (optional) (possibly nested tuple of) `TensorShape` objects

containing the depth(s) of the input tensors expected to be passed in to

the `DropoutWrapper`. Required and used **iff**

`variational_recurrent = True` and `input_keep_prob < 1`.





dtype: (optional) The `dtype` of the input, state, and output tensors.

Required and used **iff** `variational_recurrent = True`.

seed: (optional) integer, the randomness seed.

dropout_state_filter_visitor: (optional), default: (see below). Function

that takes any hierarchical level of the state and returns

a scalar or depth=1 structure of Python booleans describing

which terms in the state should be dropped out. In addition, if the

function returns `True`, dropout is applied across this sublevel. If

the function returns `False`, dropout is not applied across this entire

sublevel.

Default behavior: perform dropout on all terms except the memory (`c`)

state of `LSTMCellState` objects, and don't try to apply dropout to

`TensorArray` objects:

Raises:

TypeError: if `cell` is not an `RNNCell`, or `keep_state_fn` is provided

but not `callable`.

ValueError: if any of the keep_probs are not between 0 and 1.

"""

1
2
3
4
5
6
7
8
9

cell = tf.nn.rnn_cell.DropoutWrapper(cell=tf.nn.rnn_cell.LSTMCell(num_units=128),

input_keep_prob=1.0,

output_keep_prob=1.0,

state_keep_prob=1.0)

1
2
3

cell.state_size, cell.output_size

(LSTMStateTuple(c=128, h=128), 128)
1
2
3
4
5
6
7
8
9

# 多层 rnn

from tensorflow.nn.rnn_cell import *

NUM_UNITS = [32,64, 128]

rnn = MultiRNNCell([DropoutWrapper(LSTMCell(num_units=n), output_keep_prob=0.8) for n in NUM_UNITS])

1
2
3

rnn.output_size, rnn.state_size

(128,

 (LSTMStateTuple(c=32, h=32),

  LSTMStateTuple(c=64, h=64),

  LSTMStateTuple(c=128, h=128)))

tf.nn.dynamic_rnn

最后前面说了这么多 class,他们都只是一种计算当前时间步的 output 和 state 的方式,但是 rnn 处理的都是序列,所以怎么将这些 cell 对象封装到序列 rnn 中

1
2
3
4
5
6
7
8
9
10
11
12
13

def dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None,

dtype=None, parallel_iterations=None, swap_memory=False,

time_major=False, scope=None):

"""Creates a recurrent neural network specified by RNNCell `cell`.

Performs fully dynamic unrolling of `inputs`.

"""

1
2
3
4
5

rnn_layers = [tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=n)) for n in [32, 64]]

cell = tf.nn.rnn_cell.MultiRNNCell(rnn_layers)

1
2
3

inputs = tf.random_normal(shape=[30, 10, 100])

1
2
3

initial_state = cell.zero_state(batch_size=30, dtype=tf.float32)

1
2
3

output, state = tf.nn.dynamic_rnn(cell,inputs=inputs, initial_state=initial_state, dtype=tf.float32)

1
2
3

output.shape

TensorShape([Dimension(30), Dimension(10), Dimension(64)])
1
2
3

cell.state_size

(LSTMStateTuple(c=32, h=32), LSTMStateTuple(c=64, h=64))

所以目前为止,暂时ojbk了~~ 接下来就是在 attention 封装 rnn 了