# 论文笔记-batch,layer,weights normalization

paper:

## Layer Normalization

### Motivation

batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case.

Batch Normalization 就是将线性输出归一化。

batch normalization requires running averages of the summed input statistics. In feed-forward networks with fixed depth, it is straightforward to store the statistics separately for each hidden layer. However, the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps.

BN 不是用于 RNN 是因为 batch 中的 sentence 长度不一致。我们可以把每一个时间步看作一个维度的特征提取，如果像 BN 一样在这个维度上进行归一化，显然在 RNN 上是行不通的。比如这个 batch 中最长的序列的最后一个时间步，他的均值就是它本身了，岂不是出现了 BN 在单个样本上训练的情况。

In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.

### Layer Normalization

layer normalization 并不是在样本上求平均值和方差，而是在 hidden units 上求平均值和方差。

BN 和 LN 的差异：

Layer normalisztion 在单个样本上取均值和方差，所以在训练和测试阶段都是一致的。

### Layer normalized recurrent neural networks

RNN is common among the NLP tasks to have different sentence lengths for different training cases. This is easy to deal with in an RNN because the same weights are used at every time-step. But when we apply batch normalization to an RNN in the obvious way, we need to to compute and store separate statistics for each time step in a sequence. This is problematic if a test sequence is longer than any of the training sequences. Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step. It also has only one set of gain and bias parameters shared over all time-steps.

$a^t=W_{hh}h^{t-1}+W_{xh}x^t$

layer normalize 在 LSTM 上的使用：

## tensorflow 实现

### layer normalization

layer_norm_mine 得到的结果与源码一致。可以发现 计算均值和方差时， tf.nn.momentsaxes=[1:-1]. （tf.nn.moments 中 axes 的含义是在这些维度上求均值和方差）. 也就是说得到的均值和方差确实是 [batch,]. 只是在转换成 beta 和 gamma 的分布时，依旧是在最后一个维度上进行的。有意思，所以最终的效果应该和 batch normalization 效果是一致的。只不过是否符合图像或文本的特性就另说了。

Xie Pan

2018-10-01

2021-06-29