Tensorflow RNN API 源码阅读

在三星研究院实习一段时间发现在公司写代码和在学校还是有差别的。一是在公司要追求效率,会使用很多官方封装好的api,而在学校的时候因为要去理解内部原理,更多的是在造轮子,导致对很多 api 不是很熟悉。但实际上官方api不仅在速度,以及全面性上都比自己写的还是好很多的。二是,在公司对代码的复用率要求比较高,模型跑到哪一个版本了,对应的参数都要留下来,随时可以跑起来,而不是重新训练,这对模型、参数的保存要求很重要。以及在测试集上的性能指标都要在代码上很完整,而不是仅仅看看 loss 和 accuracy 就可以的。

这节内容主要是详细过一遍 tensorflow 里面的 rnn api,根据RNN and Cells (contrib)这里的顺序逐步深入研究

先回顾一下 RNN/LSTM/GRU:

参考之前 cs224d 的笔记
- cs224d-lecture9 机器翻译
- cs224d-lecture8-RNN 发现有些小错误,但不影响自己复习。。

basic rnn:

\[h_t = \sigma(W_{hh}h_{t-1}+W_{hx}x_{|t|})\]

先看 tf.contrib.rnn.RNNCell

https://github.com/tensorflow/tensorflow/blob/4dcfddc5d12018a5a0fdca652b9221ed95e9eb23/tensorflow/python/ops/rnn_cell_impl.py#L170)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
@tf_export("nn.rnn_cell.RNNCell")
class RNNCell(base_layer.Layer):
"""Abstract object representing an RNN cell.
Every `RNNCell` must have the properties below and implement `call` with
the signature `(output, next_state) = call(input, state)`.

RNNCell 是一个抽象的父类,之后更复杂的 RNN/LSTM/GRU 都是重新实现 call 函数,也就是更新隐藏状态
的方式改变了。

The optional third input argument, `scope`, is allowed for backwards compatibility
purposes; but should be left off for new subclasses.

scope 这个参数管理变量,在反向传播中变量是否可训练。

This definition of cell differs from the definition used in the literature.
In the literature, 'cell' refers to an object with a single scalar output.
This definition refers to a horizontal array of such units.

这里的 cell 的概念和一些论文中是不一样的。在论文中,cell 表示一个神经元,也就是单个值。而这里表示的是
一组神经元,比如隐藏状态[batch, num_units].

An RNN cell, in the most abstract setting, is anything that has
a state and performs some operation that takes a matrix of inputs.
This operation results in an output matrix with `self.output_size` columns.
If `self.state_size` is an integer, this operation also results in a new
state matrix with `self.state_size` columns. If `self.state_size` is a
(possibly nested tuple of) TensorShape object(s), then it should return a
matching structure of Tensors having shape `[batch_size].concatenate(s)`
for each `s` in `self.batch_size`.

rnn cell 的输入是一个状态 state 和 input 矩阵,参数有 self.output_size 和 self.state_size.
分别表示输出层和隐藏层的维度。其中 state_size 可能是 tuple,这个之后在看。
"""

def __call__(self, inputs, state, scope=None):
"""Run this RNN cell on inputs, starting from the given state.
Args:
inputs: `2-D` tensor with shape `[batch_size, input_size]`.
state: if `self.state_size` is an integer, this should be a `2-D Tensor`
with shape `[batch_size, self.state_size]`. Otherwise, if
`self.state_size` is a tuple of integers, this should be a tuple
with shapes `[batch_size, s] for s in self.state_size`.
scope: VariableScope for the created subgraph; defaults to class name.
Returns:
A pair containing:
- Output: A `2-D` tensor with shape `[batch_size, self.output_size]`.
- New state: Either a single `2-D` tensor, or a tuple of tensors matching
the arity and shapes of `state`.
"""
if scope is not None:
with vs.variable_scope(scope,
custom_getter=self._rnn_get_variable) as scope:
return super(RNNCell, self).__call__(inputs, state, scope=scope)
else:
scope_attrname = "rnncell_scope"
scope = getattr(self, scope_attrname, None)
if scope is None:
scope = vs.variable_scope(vs.get_variable_scope(),
custom_getter=self._rnn_get_variable)
setattr(self, scope_attrname, scope)
with scope:
return super(RNNCell, self).__call__(inputs, state)

def _rnn_get_variable(self, getter, *args, **kwargs):
variable = getter(*args, **kwargs)
if context.executing_eagerly():
trainable = variable._trainable # pylint: disable=protected-access
else:
trainable = (
variable in tf_variables.trainable_variables() or
(isinstance(variable, tf_variables.PartitionedVariable) and
list(variable)[0] in tf_variables.trainable_variables()))
if trainable and variable not in self._trainable_weights:
self._trainable_weights.append(variable)
elif not trainable and variable not in self._non_trainable_weights:
self._non_trainable_weights.append(variable)
return variable

@property
def state_size(self):
"""size(s) of state(s) used by this cell.
It can be represented by an Integer, a TensorShape or a tuple of Integers
or TensorShapes.
"""
raise NotImplementedError("Abstract method")

@property
def output_size(self):
"""Integer or TensorShape: size of outputs produced by this cell."""
raise NotImplementedError("Abstract method")

def build(self, _):
# This tells the parent Layer object that it's OK to call
# self.add_variable() inside the call() method.
pass

def zero_state(self, batch_size, dtype):
"""Return zero-filled state tensor(s).
Args:
batch_size: int, float, or unit Tensor representing the batch size.
dtype: the data type to use for the state.
Returns:
If `state_size` is an int or TensorShape, then the return value is a
`N-D` tensor of shape `[batch_size, state_size]` filled with zeros.
If `state_size` is a nested list or tuple, then the return value is
a nested list or tuple (of the same structure) of `2-D` tensors with
the shapes `[batch_size, s]` for each s in `state_size`.
"""
# Try to use the last cached zero_state. This is done to avoid recreating
# zeros, especially when eager execution is enabled.
state_size = self.state_size
is_eager = context.executing_eagerly()
if is_eager and hasattr(self, "_last_zero_state"):
(last_state_size, last_batch_size, last_dtype,
last_output) = getattr(self, "_last_zero_state")
if (last_batch_size == batch_size and
last_dtype == dtype and
last_state_size == state_size):
return last_output
with ops.name_scope(type(self).__name__ + "ZeroState", values=[batch_size]):
output = _zero_state_tensors(state_size, batch_size, dtype)
if is_eager:
self._last_zero_state = (state_size, batch_size, dtype, output)
return output

两个属性 output_size, state_size 分别表示输出层的维度和隐藏层的维度。call 函数用来表示计算下一个时间步的隐藏状态和输出,zero_state 函数用来初始化初始状态全为 0, 这里 state_size 有两种情况,一种是 int 或 tensorshape,那么 [batch, state_size]. 如果是多层嵌套 rnn, 那么初始状态 [batch, s] for s in state_size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class LayerRNNCell(RNNCell):
"""Subclass of RNNCells that act like proper `tf.Layer` objects.

def __call__(self, inputs, state, scope=None, *args, **kwargs):
"""Run this RNN cell on inputs, starting from the given state.
Args:
inputs: `2-D` tensor with shape `[batch_size, input_size]`.
state: if `self.state_size` is an integer, this should be a `2-D Tensor`
with shape `[batch_size, self.state_size]`. Otherwise, if
`self.state_size` is a tuple of integers, this should be a tuple
with shapes `[batch_size, s] for s in self.state_size`.
scope: optional cell scope.
*args: Additional positional arguments.
**kwargs: Additional keyword arguments.
Returns:
A pair containing:
- Output: A `2-D` tensor with shape `[batch_size, self.output_size]`.
- New state: Either a single `2-D` tensor, or a tuple of tensors matching
the arity and shapes of `state`.
"""
# Bypass RNNCell's variable capturing semantics for LayerRNNCell.
# Instead, it is up to subclasses to provide a proper build
# method. See the class docstring for more details.
return base_layer.Layer.__call__(self, inputs, state, scope=scope,
*args, **kwargs)

再看 Core RNN Cells

  • tf.contrib.rnn.BasicRNNCell
  • tf.contrib.rnn.BasicLSTMCell
  • tf.contrib.rnn.GRUCell
  • tf.contrib.rnn.LSTMCell
  • tf.contrib.rnn.LayerNormBasicLSTMCell

tf.contrib.rnn.BasicRNNCell

直接扒源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
@tf_export("nn.rnn_cell.BasicRNNCell")
class BasicRNNCell(LayerRNNCell):
"""The most basic RNN cell.
Args:
num_units: int, The number of units in the RNN cell.
activation: Nonlinearity to use. Default: `tanh`.
reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not `True`, and the existing scope already has
the given variables, an error is raised.
name: String, the name of the layer. Layers with the same name will
share weights, but to avoid mistakes we require reuse=True in such
cases.
dtype: Default dtype of the layer (default of `None` means use the type
of the first input). Required when `build` is called before `call`.
"""

def __init__(self,
num_units,
activation=None,
reuse=None,
name=None,
dtype=None):
super(BasicRNNCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)

# Inputs must be 2-dimensional.
self.input_spec = base_layer.InputSpec(ndim=2)

self._num_units = num_units
self._activation = activation or math_ops.tanh

@property
def state_size(self):
return self._num_units

@property
def output_size(self):
return self._num_units

def build(self, inputs_shape):
if inputs_shape[1].value is None:
raise ValueError("Expected inputs.shape[-1] to be known, saw shape: %s"
% inputs_shape)

input_depth = inputs_shape[1].value
self._kernel = self.add_variable(
_WEIGHTS_VARIABLE_NAME,
shape=[input_depth + self._num_units, self._num_units])
self._bias = self.add_variable(
_BIAS_VARIABLE_NAME,
shape=[self._num_units],
initializer=init_ops.zeros_initializer(dtype=self.dtype))

self.built = True

def call(self, inputs, state):
"""Most basic RNN: output = new_state = act(W * input + U * state + B)."""

gate_inputs = math_ops.matmul(
array_ops.concat([inputs, state], 1), self._kernel)
gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
output = self._activation(gate_inputs)
return output, output

可以发现 state_size = output_size = num_units, 以及输出就是下一个隐藏状态 output = new_state = act(W * input + U * state + B) = act(W[input, state] + b) 其中 self._kernel 表示 W, 其维度是 [input_depth + num_units, num_units]

1
2
3
4
5
import tensorflow as tf
import tensorflow.contrib.eager as tfe

tf.enable_eager_execution()
print(tfe.executing_eagerly())
True
1
2
cell = tf.contrib.rnn.BasicRNNCell(num_units=128, activation=None)
print(cell.state_size, cell.output_size)
128 128
1
2
3
4
inputs = tf.random_normal(shape=[32, 100], dtype=tf.float32)
h0 = cell.zero_state(batch_size=32, dtype=tf.float32)
output, state = cell(inputs=inputs, state=h0)
print(output.shape, state.shape)
(32, 128) (32, 128)

tf.contrib.rnn.BasicLSTMCell

先回顾下 LSTM:

自己试着手敲公式~ 看着图还是简单,不看图是否也可以呢?

三个gate:遗忘门,输入/更新门,输出门

\[f_t=\sigma(W^{f}x_t + U^{f}h_{t-1})\] \[i_t=\sigma(W^{i}x_t + U^{i}h_{t-1})\] \[o_t=\sigma(W^{o}x_t + U^{o}h_{t-1})\]

new memory cell:

\[\hat c=tanh(W^cx_t + U^ch_{t-1})\]

输入门作用于新的记忆细胞,遗忘门作用于上一个记忆细胞,并得到最终的记忆细胞:

\[c_t=f_t\circ c_{t-1} + i_t\circ\hat c\]

用新的memory cell 和输出门得到新的隐藏状态:

\[h_t = tanh(o_t\circ c_t)\]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
class BasicLSTMCell(LayerRNNCell):
"""Basic LSTM recurrent network cell.
The implementation is based on: http://arxiv.org/abs/1409.2329.
We add forget_bias (default: 1) to the biases of the forget gate in order to
reduce the scale of forgetting in the beginning of the training.
It does not allow cell clipping, a projection layer, and does not
use peep-hole connections: it is the basic baseline.
For advanced models, please use the full @{tf.nn.rnn_cell.LSTMCell}
that follows.
"""

def __init__(self,
num_units,
forget_bias=1.0,
state_is_tuple=True,
activation=None,
reuse=None,
name=None,
dtype=None):
"""Initialize the basic LSTM cell.
Args:
num_units: int, The number of units in the LSTM cell.
forget_bias: float, The bias added to forget gates (see above).
Must set to `0.0` manually when restoring from CudnnLSTM-trained
checkpoints.
state_is_tuple: If True, accepted and returned states are 2-tuples of
the `c_state` and `m_state`. If False, they are concatenated
along the column axis. The latter behavior will soon be deprecated.
activation: Activation function of the inner states. Default: `tanh`.
reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not `True`, and the existing scope already has
the given variables, an error is raised.
name: String, the name of the layer. Layers with the same name will
share weights, but to avoid mistakes we require reuse=True in such
cases.
dtype: Default dtype of the layer (default of `None` means use the type
of the first input). Required when `build` is called before `call`.
When restoring from CudnnLSTM-trained checkpoints, must use
`CudnnCompatibleLSTMCell` instead.
"""
super(BasicLSTMCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)
if not state_is_tuple:
logging.warn("%s: Using a concatenated state is slower and will soon be "
"deprecated. Use state_is_tuple=True.", self)

# Inputs must be 2-dimensional.
self.input_spec = base_layer.InputSpec(ndim=2)

self._num_units = num_units
self._forget_bias = forget_bias
self._state_is_tuple = state_is_tuple
self._activation = activation or math_ops.tanh

@property
def state_size(self):
return (LSTMStateTuple(self._num_units, self._num_units)
if self._state_is_tuple else 2 * self._num_units)

@property
def output_size(self):
return self._num_units

def build(self, inputs_shape):
if inputs_shape[1].value is None:
raise ValueError("Expected inputs.shape[-1] to be known, saw shape: %s"
% inputs_shape)

input_depth = inputs_shape[1].value
h_depth = self._num_units
self._kernel = self.add_variable(
_WEIGHTS_VARIABLE_NAME,
shape=[input_depth + h_depth, 4 * self._num_units])
self._bias = self.add_variable(
_BIAS_VARIABLE_NAME,
shape=[4 * self._num_units],
initializer=init_ops.zeros_initializer(dtype=self.dtype))

self.built = True

def call(self, inputs, state):
"""Long short-term memory cell (LSTM).
Args:
inputs: `2-D` tensor with shape `[batch_size, input_size]`.
state: An `LSTMStateTuple` of state tensors, each shaped
`[batch_size, num_units]`, if `state_is_tuple` has been set to
`True`. Otherwise, a `Tensor` shaped
`[batch_size, 2 * num_units]`.
Returns:
A pair containing the new hidden state, and the new state (either a
`LSTMStateTuple` or a concatenated state, depending on
`state_is_tuple`).
"""
sigmoid = math_ops.sigmoid
one = constant_op.constant(1, dtype=dtypes.int32)
# Parameters of gates are concatenated into one multiply for efficiency.
if self._state_is_tuple:
c, h = state
else:
c, h = array_ops.split(value=state, num_or_size_splits=2, axis=one)

gate_inputs = math_ops.matmul(
array_ops.concat([inputs, h], 1), self._kernel)
gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)

# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(
value=gate_inputs, num_or_size_splits=4, axis=one)

forget_bias_tensor = constant_op.constant(self._forget_bias, dtype=f.dtype)
# Note that using `add` and `multiply` instead of `+` and `*` gives a
# performance improvement. So using those at the cost of readability.
add = math_ops.add
multiply = math_ops.multiply
new_c = add(multiply(c, sigmoid(add(f, forget_bias_tensor))),
multiply(sigmoid(i), self._activation(j)))
new_h = multiply(self._activation(new_c), sigmoid(o))

if self._state_is_tuple:
new_state = LSTMStateTuple(new_c, new_h)
else:
new_state = array_ops.concat([new_c, new_h], 1)
return new_h, new_state

阅读源码可以发现具体实现与上面的公式还是有点差别的。
- 先 concat[input, h], 然后 gate_input = matmul(concat[input, h], self._kernel)+self._bias,多了偏置项,这里的矩阵维度 [input_depth + h_depth,4*num_units]. 然后 i,j,f,o = split(gate_input, 4, axis=1). 其中 j 表示 new memory cell. 然后计算 new_c,其中 i,f,o 对应的激活函数确定是 sigmoid,因为其范围只能在(0,1)之间。但是 j 的激活函数self._activation 可以选择,默认是 tanh.
- 与公式的差别之二在于 self._forget_bias.遗忘门在激活函数 \(\sigma\) 之前加了偏置,目的是避免在训练初期丢失太多信息。
- 要注意 state 的形式,取决于参数 self._state_is_tuple. 其中 c,h=state,表示 \(c_{t-1},h_{t-1}\)

1
lstm_cell = tf.contrib.rnn.BasicLSTMCell(num_units=128, forget_bias=1.0, state_is_tuple=True)
WARNING:tensorflow:From <ipython-input-9-3f4ca183c5d7>:1: BasicLSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is deprecated, please use tf.nn.rnn_cell.LSTMCell, which supports all the feature this cell currently has. Please replace the existing code with tf.nn.rnn_cell.LSTMCell(name='basic_lstm_cell').

提示要更新了,那就换成最新的吧

1
lstm_cell = tf.nn.rnn_cell.LSTMCell(num_units=128, forget_bias=1.0, state_is_tuple=True)
1
lstm_cell.output_size, lstm_cell.state_size
(128, LSTMStateTuple(c=128, h=128))
1
h0 = lstm_cell.zero_state(batch_size=30, dtype=tf.float32)

我们发现 lstm 的状态 state 是一个tuple,分别对应 c_t 和 h_t.

1
2
3
4
class LSTMStateTuple(_LSTMStateTuple):
"""Tuple used by LSTM Cells for `state_size`, `zero_state`, and output state.
Stores two elements: `(c, h)`, in that order. Where `c` is the hidden state
and `h` is the output.

这里的解释感觉是有点问题的,c is the hidden state and h is the output. 看源码

1
2
3
4
5
6
7
8
9
new_c = add(multiply(c, sigmoid(add(f, forget_bias_tensor))),
multiply(sigmoid(i), self._activation(j)))
new_h = multiply(self._activation(new_c), sigmoid(o))

if self._state_is_tuple:
new_state = LSTMStateTuple(new_c, new_h)
else:
new_state = array_ops.concat([new_c, new_h], 1)
return new_h, new_state

发现 c 表示的就是 new memory cell, 而 h 表示的是最后的隐藏状态。

1
2
3
# 计算下一步的 output 和 state
inputs = tf.random_normal(shape=[30, 100], dtype=tf.float32)
output, state = lstm_cell(inputs, h0)
1
output.shape, state[0].shape, state[1].shape
(TensorShape([Dimension(30), Dimension(128)]),
 TensorShape([Dimension(30), Dimension(128)]),
 TensorShape([Dimension(30), Dimension(128)]))
1
state.c, state.h  # c 和 h 的值是不一样的
(<tf.Tensor: id=108, shape=(30, 128), dtype=float32, numpy=
 array([[ 0.08166471,  0.14020835,  0.07970127, ..., -0.1540019 ,
          0.38848224, -0.0842322 ],
        [-0.03643086, -0.20558938,  0.1503458 , ...,  0.01846285,
          0.15610473,  0.04408235],
        [-0.0933667 ,  0.03454542, -0.09073547, ..., -0.12701994,
         -0.34669587,  0.09373946],
        ...,
        [-0.00752909,  0.22412673, -0.270195  , ...,  0.09341058,
         -0.20986181, -0.18622127],
        [ 0.18778914,  0.37687936, -0.24727295, ..., -0.06409463,
          0.00218048,  0.5940756 ],
        [ 0.04073388, -0.08431841,  0.35944715, ...,  0.14135318,
          0.08472287, -0.11058106]], dtype=float32)>,
 <tf.Tensor: id=111, shape=(30, 128), dtype=float32, numpy=
 array([[ 0.04490132,  0.07412361,  0.03662094, ..., -0.07611651,
          0.17290959, -0.0277745 ],
        [-0.02212535, -0.13554382,  0.08272093, ...,  0.00918258,
          0.0861209 ,  0.02614526],
        [-0.05723168,  0.01372226, -0.02919216, ..., -0.06374882,
         -0.1918035 ,  0.03912015],
        ...,
        [-0.00377504,  0.15181372, -0.14555399, ...,  0.06073361,
         -0.09804281, -0.07492835],
        [ 0.10244624,  0.17440473, -0.09896267, ..., -0.03794969,
          0.00123257,  0.21985768],
        [ 0.01832823, -0.03795732,  0.1654894 , ...,  0.05827027,
          0.02769112, -0.05957894]], dtype=float32)>)

tf.nn.rnn_cell.GRUCell

先回顾下 GRU.

手敲 GRU 公式: \[r_t=\sigma(W^rx_t + U^rh_{t-1})\] \[z_t=\sigma(W^zx_t + U^zh_{t-1})\] \[\tilde h_t = tanh(Wx_t + r_t\circ h_{t-1})\] \[h_t=(1-z_t)\circ\tilde h_t + z_t\circ h_{t-1}\]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
@tf_export("nn.rnn_cell.GRUCell")
class GRUCell(LayerRNNCell):
"""Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078).
Args:
num_units: int, The number of units in the GRU cell.
activation: Nonlinearity to use. Default: `tanh`.
reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not `True`, and the existing scope already has
the given variables, an error is raised.
kernel_initializer: (optional) The initializer to use for the weight and
projection matrices.
bias_initializer: (optional) The initializer to use for the bias.
name: String, the name of the layer. Layers with the same name will
share weights, but to avoid mistakes we require reuse=True in such
cases.
dtype: Default dtype of the layer (default of `None` means use the type
of the first input). Required when `build` is called before `call`.
"""

def __init__(self,
num_units,
activation=None,
reuse=None,
kernel_initializer=None,
bias_initializer=None,
name=None,
dtype=None):
super(GRUCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)

# Inputs must be 2-dimensional.
self.input_spec = base_layer.InputSpec(ndim=2)

self._num_units = num_units
self._activation = activation or math_ops.tanh
self._kernel_initializer = kernel_initializer
self._bias_initializer = bias_initializer

@property
def state_size(self):
return self._num_units

@property
def output_size(self):
return self._num_units

def build(self, inputs_shape):
if inputs_shape[1].value is None:
raise ValueError("Expected inputs.shape[-1] to be known, saw shape: %s"
% inputs_shape)

input_depth = inputs_shape[1].value
self._gate_kernel = self.add_variable(
"gates/%s" % _WEIGHTS_VARIABLE_NAME,
shape=[input_depth + self._num_units, 2 * self._num_units],
initializer=self._kernel_initializer)
self._gate_bias = self.add_variable(
"gates/%s" % _BIAS_VARIABLE_NAME,
shape=[2 * self._num_units],
initializer=(
self._bias_initializer
if self._bias_initializer is not None
else init_ops.constant_initializer(1.0, dtype=self.dtype)))
self._candidate_kernel = self.add_variable(
"candidate/%s" % _WEIGHTS_VARIABLE_NAME,
shape=[input_depth + self._num_units, self._num_units],
initializer=self._kernel_initializer)
self._candidate_bias = self.add_variable(
"candidate/%s" % _BIAS_VARIABLE_NAME,
shape=[self._num_units],
initializer=(
self._bias_initializer
if self._bias_initializer is not None
else init_ops.zeros_initializer(dtype=self.dtype)))

self.built = True

def call(self, inputs, state):
"""Gated recurrent unit (GRU) with nunits cells."""

gate_inputs = math_ops.matmul(
array_ops.concat([inputs, state], 1), self._gate_kernel)
gate_inputs = nn_ops.bias_add(gate_inputs, self._gate_bias)

value = math_ops.sigmoid(gate_inputs)
r, u = array_ops.split(value=value, num_or_size_splits=2, axis=1)

r_state = r * state

candidate = math_ops.matmul(
array_ops.concat([inputs, r_state], 1), self._candidate_kernel)
candidate = nn_ops.bias_add(candidate, self._candidate_bias)

c = self._activation(candidate)
new_h = u * state + (1 - u) * c
return new_h, new_h


_LSTMStateTuple = collections.namedtuple("LSTMStateTuple", ("c", "h"))

仔细阅读源码发现在 \(sigma\) 计算 gate,以及 tanh 计算 candidate 之前都有偏置项,不过公式中都没写出来。而且在不设置 bias 的初始值时,默认的 GRU 中 gate_bias 的初始值是 1, 而 LSTM 中 gate_bias 的初始值是 0.

1
gru_cell = tf.nn.rnn_cell.GRUCell(num_units=128)
1
gru_cell.state_size, gru_cell.output_size
(128, 128)
1
h0 = gru_cell.zero_state(batch_size=30, dtype=tf.float32)
1
2
3
inputs = tf.random_normal(shape=[30, 100], dtype=tf.float32)
output, state = gru_cell(inputs, h0)
output.shape, state.shape
(TensorShape([Dimension(30), Dimension(128)]),
 TensorShape([Dimension(30), Dimension(128)]))

出现个很神奇的现象,如果我写成这样:

1
2
output, state = gru_cell.call(inputs, h0) # 会报错的,gru_cell 没有 self._gate_kernel 这个属性,很神奇,
# 不过这里先运行上面那行代码,所以没有出现报错

tf.nn.rnn_cell.LSTMCell, tf.contrib.rnn.LSTMCell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
@tf_export("nn.rnn_cell.LSTMCell")
class LSTMCell(LayerRNNCell):
"""Long short-term memory unit (LSTM) recurrent network cell.
The default non-peephole implementation is based on:
http://www.bioinf.jku.at/publications/older/2604.pdf
S. Hochreiter and J. Schmidhuber.
"Long Short-Term Memory". Neural Computation, 9(8):1735-1780, 1997.
The peephole implementation is based on:
https://research.google.com/pubs/archive/43905.pdf
Hasim Sak, Andrew Senior, and Francoise Beaufays.
"Long short-term memory recurrent neural network architectures for
large scale acoustic modeling." INTERSPEECH, 2014.
The class uses optional peep-hole connections, optional cell clipping, and
an optional projection layer.
"""

def __init__(self, num_units,
use_peepholes=False, cell_clip=None,
initializer=None, num_proj=None, proj_clip=None,
num_unit_shards=None, num_proj_shards=None,
forget_bias=1.0, state_is_tuple=True,
activation=None, reuse=None, name=None, dtype=None):
"""Initialize the parameters for an LSTM cell.
"""
super(LSTMCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)

相比 BasicLSTMCell 多了这四个参数:

1
2
3
4
5
6
7
8
9
10
Args:
use_peepholes: bool, set True to enable diagonal/peephole connections.
cell_clip: (optional) A float value, if provided the cell state is clipped
by this value prior to the cell output activation.
num_proj: (optional) int, The output dimensionality for the projection
matrices. If None, no projection is performed.
proj_clip: (optional) A float value. If `num_proj > 0` and `proj_clip` is
provided, then the projected values are clipped elementwise to within
`[-proj_clip, proj_clip]`.
`

其中 cell_clip 很好理解,就是限制隐藏状态的大小,也就是 output 和 state 的大小。 而 num_proj 呢?

1
2
3
4
5
6
7
8
9
10
11
   if num_proj:
self._state_size = (
LSTMStateTuple(num_units, num_proj)
if state_is_tuple else num_units + num_proj)
self._output_size = num_proj
else:
self._state_size = (
LSTMStateTuple(num_units, num_units)
if state_is_tuple else 2 * num_units)
self._output_size = num_units
`

通过源码可以发现,如果有 num_proj 那么 state 还要加一个全链接, state_size = num_units + num_proj. 而 proj_clip 是限制这个全链接的输出的。

BacisLSTMCell 和 LSTMCell 区别还在于后者增加了 peephole 和 cell_clip

输入输出的shape,以及状态都是一样的,只不过内部计算方式更加复杂了。对于 peephole 是值得计算 gate 时,考虑到了 \(c_{t-1}\)\(c_t\).

1
cell = tf.nn.rnn_cell.LSTMCell(num_units=64,cell_clip=0.000000001, num_proj=128, proj_clip=0.001)
1
cell.state_size, cell.output_size
(LSTMStateTuple(c=64, h=128), 128)

发现 state_size 中的 h 维度发生了变化,相当于在每一个时间步得到的 state.h 之后再添加一个全链接。 在 decoder 中可以将 num_proj 设置为词表的大小,那么输出就是对应时间步的词表的分布。在此基础上在 softmax 就可以吧? 但是下一个时间步的输入的隐藏状态 \(h_{t-1}\) 岂不是维度为词表大小。。。感觉最好还是不用这个参数吧

1
2
h0 = cell.zero_state(batch_size=30, dtype=tf.float32)
h0.c.shape, h0.h.shape
(TensorShape([Dimension(30), Dimension(64)]),
 TensorShape([Dimension(30), Dimension(128)]))
1
inputs = tf.ones(shape=[30,50])
1
output, state = cell(inputs=inputs, state=h0)
1
output.shape, state.c.shape, state.h.shape
(TensorShape([Dimension(30), Dimension(128)]),
 TensorShape([Dimension(30), Dimension(64)]),
 TensorShape([Dimension(30), Dimension(128)]))

封装了 RNN 的其他组件

Core RNN Cell wrappers (RNNCells that wrap other RNNCells)

  • tf.contrib.rnn.MultiRNNCell
  • tf.contrib.rnn.LSTMBlockWrapper
  • tf.contrib.rnn.DropoutWrapper
  • tf.contrib.rnn.EmbeddingWrapper
  • tf.contrib.rnn.InputProjectionWrapper
  • tf.contrib.rnn.OutputProjectionWrapper
  • tf.contrib.rnn.DeviceWrapper
  • tf.contrib.rnn.ResidualWrapper

主要看

和 ```tf.contrib.rnn.DropoutWrapper```吧,其他的封装的太好了也不好,用的其实也少。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

### tf.contrib.rnn.MultiRNNCell
```python
class MultiRNNCell(RNNCell):
"""
RNN cell composed sequentially of multiple simple cells.
"""

def __init__(self, cells, state_is_tuple=True):
"""Create a RNN cell composed sequentially of a number of RNNCells.
Args:
cells: list of RNNCells that will be composed in this order.
state_is_tuple: If True, accepted and returned states are n-tuples, where
`n = len(cells)`. If False, the states are all
concatenated along the column axis. This latter behavior will soon be
deprecated.
Raises:
ValueError: if cells is empty (not allowed), or at least one of the cells
returns a state tuple but the flag `state_is_tuple` is `False`.
"""
super(MultiRNNCell, self).__init__()
if not cells:
raise ValueError("Must specify at least one cell for MultiRNNCell.")
if not nest.is_sequence(cells):
raise TypeError(
"cells must be a list or tuple, but saw: %s." % cells)

self._cells = cells
for cell_number, cell in enumerate(self._cells):
# Add Checkpointable dependencies on these cells so their variables get
# saved with this object when using object-based saving.
if isinstance(cell, checkpointable.CheckpointableBase):
# TODO(allenl): Track down non-Checkpointable callers.
self._track_checkpointable(cell, name="cell-%d" % (cell_number,))
self._state_is_tuple = state_is_tuple
if not state_is_tuple:
if any(nest.is_sequence(c.state_size) for c in self._cells):
raise ValueError("Some cells return tuples of states, but the flag "
"state_is_tuple is not set. State sizes are: %s"
% str([c.state_size for c in self._cells]))

@property
def state_size(self):
if self._state_is_tuple:
return tuple(cell.state_size for cell in self._cells)
else:
return sum([cell.state_size for cell in self._cells])

@property
def output_size(self):
return self._cells[-1].output_size

def zero_state(self, batch_size, dtype):
with ops.name_scope(type(self).__name__ + "ZeroState", values=[batch_size]):
if self._state_is_tuple:
return tuple(cell.zero_state(batch_size, dtype) for cell in self._cells)
else:
# We know here that state_size of each cell is not a tuple and
# presumably does not contain TensorArrays or anything else fancy
return super(MultiRNNCell, self).zero_state(batch_size, dtype)

def call(self, inputs, state):
"""Run this multi-layer cell on inputs, starting from state."""
cur_state_pos = 0
cur_inp = inputs
new_states = []
for i, cell in enumerate(self._cells):
with vs.variable_scope("cell_%d" % i):
if self._state_is_tuple:
if not nest.is_sequence(state):
raise ValueError(
"Expected state to be a tuple of length %d, but received: %s" %
(len(self.state_size), state))
cur_state = state[i]
else:
cur_state = array_ops.slice(state, [0, cur_state_pos],
[-1, cell.state_size])
cur_state_pos += cell.state_size
cur_inp, new_state = cell(cur_inp, cur_state)
new_states.append(new_state)

new_states = (tuple(new_states) if self._state_is_tuple else
array_ops.concat(new_states, 1))

return cur_inp, new_states

参数 cell 是元素为 RNNCell对象 的 list 或 tuple. 其实还是只是计算一个时间步的 state

这里先不考虑双向,只考虑 deep. 也就是使用 MultiRNNCell

1
2
3
num_units = [64, 128]
stack_rnns = [tf.nn.rnn_cell.BasicLSTMCell(num_units=i) for i in num_units]
stack_rnn_cell = tf.nn.rnn_cell.MultiRNNCell(stack_rnns)
1
h0 = [cell.zero_state(batch_size=32, dtype=tf.float32) for cell in stack_rnn]
1
2
inputs = tf.random_normal(shape=[32, 100], dtype=tf.float32)
output, state = stack_rnn_cell(inputs=inputs, state=h0)
1
output.shape
TensorShape([Dimension(32), Dimension(128)])
1
state[0].c.shape, state[0].h.shape
(TensorShape([Dimension(32), Dimension(64)]),
 TensorShape([Dimension(32), Dimension(64)]))
1
state[1].c.shape, state[1].h.shape
(TensorShape([Dimension(32), Dimension(128)]),
 TensorShape([Dimension(32), Dimension(128)]))

源码中的一部分:

1
2
cur_inp, new_state = cell(cur_inp, cur_state)
new_states.append(new_state)

其中从源码中也可以发现把每一层的 state 都储存起来了,而 output 要作为下一层的输入,最后得到的 output 是最后一层的输出。

tf.contrib.rnn.DropoutWrapper

参考paper: A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
@tf_export("nn.rnn_cell.DropoutWrapper")
class DropoutWrapper(RNNCell):
"""Operator adding dropout to inputs and outputs of the given cell."""

def __init__(self, cell, input_keep_prob=1.0, output_keep_prob=1.0,
state_keep_prob=1.0, variational_recurrent=False,
input_size=None, dtype=None, seed=None,
dropout_state_filter_visitor=None):
"""Create a cell with added input, state, and/or output dropout.
If `variational_recurrent` is set to `True` (**NOT** the default behavior),
then the same dropout mask is applied at every step, as described in:
Y. Gal, Z Ghahramani. "A Theoretically Grounded Application of Dropout in
Recurrent Neural Networks". https://arxiv.org/abs/1512.05287

如果参数 variational_recurrent 设置为 True,那么 dropout 在每一个时间步都会执行 dropout,


Otherwise a different dropout mask is applied at every time step.
Note, by default (unless a custom `dropout_state_filter` is provided),
the memory state (`c` component of any `LSTMStateTuple`) passing through
a `DropoutWrapper` is never modified. This behavior is described in the
above article.
Args:
cell: an RNNCell, a projection to output_size is added to it.


input_keep_prob: unit Tensor or float between 0 and 1, input keep
probability; if it is constant and 1, no input dropout will be added.
output_keep_prob: unit Tensor or float between 0 and 1, output keep
probability; if it is constant and 1, no output dropout will be added.
state_keep_prob: unit Tensor or float between 0 and 1, output keep
probability; if it is constant and 1, no output dropout will be added.
State dropout is performed on the outgoing states of the cell.
**Note** the state components to which dropout is applied when
`state_keep_prob` is in `(0, 1)` are also determined by
the argument `dropout_state_filter_visitor` (e.g. by default dropout
is never applied to the `c` component of an `LSTMStateTuple`).

上面三个参数分别表示 input,output,state 是否 dropout,以及 dropout 率。

variational_recurrent: Python bool. If `True`, then the same
dropout pattern is applied across all time steps per run call.
If this parameter is set, `input_size` **must** be provided.

这个参数如果为 True,那么每一个时间步都需要 dropout.

input_size: (optional) (possibly nested tuple of) `TensorShape` objects
containing the depth(s) of the input tensors expected to be passed in to
the `DropoutWrapper`. Required and used **iff**
`variational_recurrent = True` and `input_keep_prob < 1`.


dtype: (optional) The `dtype` of the input, state, and output tensors.
Required and used **iff** `variational_recurrent = True`.
seed: (optional) integer, the randomness seed.
dropout_state_filter_visitor: (optional), default: (see below). Function
that takes any hierarchical level of the state and returns
a scalar or depth=1 structure of Python booleans describing
which terms in the state should be dropped out. In addition, if the
function returns `True`, dropout is applied across this sublevel. If
the function returns `False`, dropout is not applied across this entire
sublevel.
Default behavior: perform dropout on all terms except the memory (`c`)
state of `LSTMCellState` objects, and don't try to apply dropout to
`TensorArray` objects:
Raises:
TypeError: if `cell` is not an `RNNCell`, or `keep_state_fn` is provided
but not `callable`.
ValueError: if any of the keep_probs are not between 0 and 1.
"""
1
2
3
4
cell = tf.nn.rnn_cell.DropoutWrapper(cell=tf.nn.rnn_cell.LSTMCell(num_units=128),
input_keep_prob=1.0,
output_keep_prob=1.0,
state_keep_prob=1.0)
1
cell.state_size, cell.output_size
(LSTMStateTuple(c=128, h=128), 128)
1
2
3
4
# 多层 rnn
from tensorflow.nn.rnn_cell import *
NUM_UNITS = [32,64, 128]
rnn = MultiRNNCell([DropoutWrapper(LSTMCell(num_units=n), output_keep_prob=0.8) for n in NUM_UNITS])
1
rnn.output_size, rnn.state_size
(128,
 (LSTMStateTuple(c=32, h=32),
  LSTMStateTuple(c=64, h=64),
  LSTMStateTuple(c=128, h=128)))

tf.nn.dynamic_rnn

最后前面说了这么多 class,他们都只是一种计算当前时间步的 output 和 state 的方式,但是 rnn 处理的都是序列,所以怎么将这些 cell 对象封装到序列 rnn 中

1
2
3
4
5
6
def dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None,
dtype=None, parallel_iterations=None, swap_memory=False,
time_major=False, scope=None):
"""Creates a recurrent neural network specified by RNNCell `cell`.
Performs fully dynamic unrolling of `inputs`.
"""
1
2
rnn_layers = [tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=n)) for n in [32, 64]]
cell = tf.nn.rnn_cell.MultiRNNCell(rnn_layers)
1
inputs = tf.random_normal(shape=[30, 10, 100])
1
initial_state = cell.zero_state(batch_size=30, dtype=tf.float32)
1
output, state = tf.nn.dynamic_rnn(cell,inputs=inputs, initial_state=initial_state, dtype=tf.float32)
1
output.shape
TensorShape([Dimension(30), Dimension(10), Dimension(64)])
1
cell.state_size
(LSTMStateTuple(c=32, h=32), LSTMStateTuple(c=64, h=64))

所以目前为止,暂时ojbk了~~ 接下来就是在 attention 封装 rnn 了