论文笔记-batch,layer,weights normalization

paper:
- Batch Normalization
- Layer Normalization
- weights Normalization

Batch Normalization

在之前的笔记已经详细看过了:深度学习-Batch Normalization

Layer Normalization

Motivation

batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case.
关于 batch normalisztion.

从 Ng 的课上截来的一张图,全链接层相比卷积层更容易理解点,但形式上是一样的.
样本数量是 m,第 l 层经过激活函数输出是第 l+1 层的输入,其中第 i 个神经元的值:

线性输出: \(z_i^l={w_i^l}^Th^l\).
非线性输出: \(h_i^{l+1} = a_i^l=f(z_i^l+b_i^l)\)

其中 f 是非线性激活函数,\(a_i^l\) 是下一层的 summed inputs. 如果 \(a_i^l\) 的分布变化较大(change in a highly correlated way),下一层的权重 \(w^{l+1}\) 的梯度也会相应变化很大(反向传播中 \(w^{l+1}\) 的梯度就是 \(a_i^l\))。

Batch Normalization 就是将线性输出归一化。

其中 \(u_i^l\) 是均值,\(\sigma_i^l\) 是方差。 \(\overline a_i^l\) 是归一化之后的输出。 \(g_i^l\) 是需要学习的参数,也就是 scale.

有个疑问?为什么 BN 要在激活函数之前进行,而不是之后进行呢?

上图中是单个样本,而所有的样本其实是共享层与层之间的参数的。样本与样本之间也存在差异,所以在某一个特征维度上进行归一化,(每一层其中的一个神经元可以看作一个特征维度)。

batch normalization requires running averages of the summed input statistics. In feed-forward networks with fixed depth, it is straightforward to store the statistics separately for each hidden layer. However, the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps.
BN 不是用于 RNN 是因为 batch 中的 sentence 长度不一致。我们可以把每一个时间步看作一个维度的特征提取,如果像 BN 一样在这个维度上进行归一化,显然在 RNN 上是行不通的。比如这个 batch 中最长的序列的最后一个时间步,他的均值就是它本身了,岂不是出现了 BN 在单个样本上训练的情况。

In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.
所以作者在这篇 paper 中提出了 Layer Normalization. 在单个样本上计算均值和方差进行归一化。然而是怎么进行的呢?

Layer Normalization

layer normalization 并不是在样本上求平均值和方差,而是在 hidden units 上求平均值和方差。

其中 H 是 hidden units 的个数。

BN 和 LN 的差异:

Layer normalisztion 在单个样本上取均值和方差,所以在训练和测试阶段都是一致的。

并且,尽管求均值和方差的方式不一样,但是在转换成 beta 和 gamma 的方式是一样的,都是在 channels 或者说 hidden_size 上进行的。

Layer normalized recurrent neural networks

RNN is common among the NLP tasks to have different sentence lengths for different training cases. This is easy to deal with in an RNN because the same weights are used at every time-step. But when we apply batch normalization to an RNN in the obvious way, we need to to compute and store separate statistics for each time step in a sequence. This is problematic if a test sequence is longer than any of the training sequences. Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step. It also has only one set of gain and bias parameters shared over all time-steps.
这一部分也解释了 BN 不适用于 RNN 的原因,从 test sequence longer 的角度。RNN 的每个时间步计算共享参数权重.

\(a^t=W_{hh}h^{t-1}+W_{xh}x^t\)

其中 b 和 g 是可学习的参数。

layer normalize 在 LSTM 上的使用:

tensorflow 实现

batch Normalization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
tf.reset_default_graph()
from tensorflow.python.training.moving_averages import assign_moving_average
from tensorflow.contrib.layers import batch_norm
### batch normalization
def batch_norm(inputs, decay=0.9, is_training=True, epsilon=1e-6):
"""

:param inputs: [batch, length, width, channels]
:param is_training:
:param eplison:
:return:
"""
pop_mean = tf.Variable(tf.zeros(inputs.shape[-1]), trainable=False, name="pop_mean")
pop_var = tf.Variable(tf.ones(inputs.shape[-1]), trainable=False, name="pop_variance")

def update_mean_and_var():
axes = list(range(inputs.shape.ndims))
batch_mean, batch_var = tf.nn.moments(inputs, axes=axes)
moving_average_mean = tf.assign(pop_mean, pop_mean * decay + batch_mean * (1-decay))
# 也可用 assign_moving_average(pop_mean, batch_mean, decay)
moving_average_var = tf.assign(pop_var, pop_var * decay + batch_var * (1-decay))
# 也可用 assign_moving_average(pop_var, batch_var, decay)
with tf.control_dependencies([moving_average_mean, moving_average_var]):
return tf.identity(batch_mean), tf.identity(batch_var)

mean, variance = tf.cond(tf.equal(is_training, True), update_mean_and_var,
lambda: (pop_mean, pop_var))
beta = tf.Variable(initial_value=tf.zeros(inputs.get_shape()[-1]), name="shift")
gamma = tf.Variable(initial_value=tf.ones(inputs.get_shape()[-1]), name="scale")
return tf.nn.batch_normalization(inputs, mean, variance, beta, gamma, epsilon)

layer normalization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import tensorflow as tf

batch = 60
hidden_size = 64
whh = tf.random_normal(shape=[batch, hidden_size], mean=5.0, stddev=10.0)

whh_norm = tf.contrib.layers.layer_norm(inputs=whh, center=True, scale=True)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(whh)
print(whh_norm)
print(sess.run([tf.reduce_mean(whh[0]), tf.reduce_mean(whh[1])]))
print(sess.run([tf.reduce_mean(whh_norm[0]), tf.reduce_mean(whh_norm[5]), tf.reduce_mean(whh_norm[59])]))
print(sess.run([tf.reduce_mean(whh_norm[:,0]), tf.reduce_mean(whh_norm[:,1]), tf.reduce_mean(whh_norm[:,63])]))
print("\n")
for var in tf.trainable_variables():
print(var)
print(sess.run(var))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Tensor("random_normal:0", shape=(60, 64), dtype=float32)
Tensor("LayerNorm/batchnorm/add_1:0", shape=(60, 64), dtype=float32)
[5.3812757, 4.607581]
[-1.4901161e-08, -2.9802322e-08, -3.7252903e-09]
[-0.22264712, 0.14112064, -0.07268284]


<tf.Variable 'LayerNorm/beta:0' shape=(64,) dtype=float32_ref>
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<tf.Variable 'LayerNorm/gamma:0' shape=(64,) dtype=float32_ref>
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

发现一个很奇怪的问题, layer norm 是在每一个训练样本上求均值和方差,为啥 beta 和 gamma 的shape却是 [hidden_size]. 按理说不应该是 [batch,] 吗? 带着疑问去看了源码,原来是这样的。。

将源码用简介的方式写出来了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
 import tensorflow as tf
def layer_norm_mine(inputs, epsilon=1e-12, center=True, scale=True):
"""
inputs: [batch, sequence_len, hidden_size] or [batch, hidden_size]
"""
inputs_shape = inputs.shape
inputs_rank = inputs_shape.ndims
params_shape = inputs_shape[-1:]
beta, gamma = None, None
if center:
beta = tf.get_variable(
name="beta",
shape=params_shape,
initializer=tf.zeros_initializer(),
trainable=True
)
if scale:
gamma = tf.get_variable(
name="gamma",
shape=params_shape,
initializer=tf.ones_initializer(),
trainable=True
)
norm_axes = list(range(1, inputs_rank))
mean, variance = tf.nn.moments(inputs, norm_axes, keep_dims=True) # [batch]
inv = tf.rsqrt(variance + epsilon)
inv *= gamma
return inputs*inv + ((beta-mean)*inv if beta is not None else - mean * inv)

batch = 60
hidden_size = 64
whh = tf.random_normal(shape=[batch, hidden_size], mean=5.0, stddev=10.0)

whh_norm = layer_norm_mine(whh)

layer_norm_mine 得到的结果与源码一致。可以发现 计算均值和方差时, tf.nn.momentsaxes=[1:-1]. (tf.nn.moments 中 axes 的含义是在这些维度上求均值和方差). 也就是说得到的均值和方差确实是 [batch,]. 只是在转换成 beta 和 gamma 的分布时,依旧是在最后一个维度上进行的。有意思,所以最终的效果应该和 batch normalization 效果是一致的。只不过是否符合图像或文本的特性就另说了。

LayerNormBasicLSTMCell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
class LayerNormBasicLSTMCell(rnn_cell_impl.RNNCell):
"""LSTM unit with layer normalization and recurrent dropout.

This class adds layer normalization and recurrent dropout to a
basic LSTM unit. Layer normalization implementation is based on:

https://arxiv.org/abs/1607.06450.

"Layer Normalization"
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

and is applied before the internal nonlinearities.
Recurrent dropout is base on:

https://arxiv.org/abs/1603.05118

"Recurrent Dropout without Memory Loss"
Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth.
"""

def __init__(self,
num_units,
forget_bias=1.0,
input_size=None,
activation=math_ops.tanh,
layer_norm=True,
norm_gain=1.0,
norm_shift=0.0,
dropout_keep_prob=1.0,
dropout_prob_seed=None,
reuse=None):
"""Initializes the basic LSTM cell.

Args:
num_units: int, The number of units in the LSTM cell.
forget_bias: float, The bias added to forget gates (see above).
input_size: Deprecated and unused.
activation: Activation function of the inner states.
layer_norm: If `True`, layer normalization will be applied.
norm_gain: float, The layer normalization gain initial value. If
`layer_norm` has been set to `False`, this argument will be ignored.
norm_shift: float, The layer normalization shift initial value. If
`layer_norm` has been set to `False`, this argument will be ignored.
dropout_keep_prob: unit Tensor or float between 0 and 1 representing the
recurrent dropout probability value. If float and 1.0, no dropout will
be applied.
dropout_prob_seed: (optional) integer, the randomness seed.
reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not `True`, and the existing scope already has
the given variables, an error is raised.
"""
super(LayerNormBasicLSTMCell, self).__init__(_reuse=reuse)

if input_size is not None:
logging.warn("%s: The input_size parameter is deprecated.", self)

self._num_units = num_units
self._activation = activation
self._forget_bias = forget_bias
self._keep_prob = dropout_keep_prob
self._seed = dropout_prob_seed
self._layer_norm = layer_norm
self._norm_gain = norm_gain
self._norm_shift = norm_shift
self._reuse = reuse

@property
def state_size(self):
return rnn_cell_impl.LSTMStateTuple(self._num_units, self._num_units)

@property
def output_size(self):
return self._num_units

def _norm(self, inp, scope, dtype=dtypes.float32):
shape = inp.get_shape()[-1:]
gamma_init = init_ops.constant_initializer(self._norm_gain)
beta_init = init_ops.constant_initializer(self._norm_shift)
with vs.variable_scope(scope):
# Initialize beta and gamma for use by layer_norm.
vs.get_variable("gamma", shape=shape, initializer=gamma_init, dtype=dtype)
vs.get_variable("beta", shape=shape, initializer=beta_init, dtype=dtype)
normalized = layers.layer_norm(inp, reuse=True, scope=scope)
return normalized

def _linear(self, args):
out_size = 4 * self._num_units
proj_size = args.get_shape()[-1]
dtype = args.dtype
weights = vs.get_variable("kernel", [proj_size, out_size], dtype=dtype)
out = math_ops.matmul(args, weights)
if not self._layer_norm:
bias = vs.get_variable("bias", [out_size], dtype=dtype)
out = nn_ops.bias_add(out, bias)
return out

def call(self, inputs, state):
"""LSTM cell with layer normalization and recurrent dropout."""
c, h = state
args = array_ops.concat([inputs, h], 1)
concat = self._linear(args)
dtype = args.dtype

i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)
if self._layer_norm:
i = self._norm(i, "input", dtype=dtype)
j = self._norm(j, "transform", dtype=dtype)
f = self._norm(f, "forget", dtype=dtype)
o = self._norm(o, "output", dtype=dtype)

g = self._activation(j)
if (not isinstance(self._keep_prob, float)) or self._keep_prob < 1:
g = nn_ops.dropout(g, self._keep_prob, seed=self._seed)

new_c = (
c * math_ops.sigmoid(f + self._forget_bias) + math_ops.sigmoid(i) * g)
if self._layer_norm:
new_c = self._norm(new_c, "state", dtype=dtype)
new_h = self._activation(new_c) * math_ops.sigmoid(o)

new_state = rnn_cell_impl.LSTMStateTuple(new_c, new_h)
return new_h, new_state