深度学习-Batch Normalization

Paper Reading

Motivation

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift.
神经网络训练过程中参数不断改变导致后续每一层输入的分布也发生变化,而学习的过程又要使每一层适应输入的分布,这使得不得不降低学习率、小心地初始化,并且使得那些具有易饱和非线性激活函数的网络训练臭名昭著。作者将分布发生变化称之为 internal covariate shift。

stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers – so that small changes to the network parameters amplify as the network becomes deeper.
在深度学习中我们采用SGD取得了非常好的效果,SGD简单有效,但是它对超参数非常敏感,尤其是学习率和初始化参数。

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer.
因为学习的过程中每一层需要去连续的适应每一层输入的分布,所以输入分布发生变化时,会产生一些问题。这里作者引用了 covariate shiftdomain adaptation 这两个概念。

Therefore, the input distribution properties that aid the network generalization – such as having the same distribution between the training and test data – apply to training the sub-network as well.As such it is advantageous for the distribution of x to remain fixed over time. 有助于网络泛化的输入分布属性:例如在训练和测试数据之间具有相同的分布,也适用于训练子网络

Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the subnetwork, as well.
固定输入分布对该子网络其他部分的网络的训练会产生积极的影响。

总结下为什么要使用 BN
在训练的过程中,因为前一层的参数改变,将会导致后一层的输入的分布不断地发生改变,这就需要降低学习速率同时要注意参数的初始化,也使具有饱和非线性(saturating nonlinearity)结构的模型非常难训练(所谓的饱和就是指函数的值域是个有限值,即当函数自变量趋向无穷时,函数值不趋向无穷)。深度神经网络之所以复杂是因为它每一层的输出都会受到之前层的影响,因此一个小小的参数改变都会对网络产生巨大的改变。作者将这种现象称为internal covariate shift,提出了对每个输入层进行规范化来解决。在文中,作者提到使用BN可以在训练的过程中使用较高的学习速率,可以比较随意的对参数进行初始化,同时BN也起到了一种正则化的作用,在某种程度上可以取代dropout的作用。

考虑一个以sigmoid为激活函数的神经层:
\(z=g(Wu+b)\)
其中 u 是输入, g 是 sigmoid 激活函数 \(g(x)=\dfrac{1}{1+exp(x)}\),当 |x| 增加时,\(g'(x)\) 趋近于0, 这意味着 \(x=Wu+b\) 的所有维度,除了绝对值较小的维度,其他的流向输入 u 的梯度都会消失,也就是进入非线性的饱和区域,这会降低模型训练速度。

在实际应用中,对于非线性饱和的情况,已经有很有对应策略:
- ReLU
- 初始化 Xavier initialization.
- 用一个较小的学习速率进行学习

If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.
如果保证非线性输入的分布稳定,优化器也就不会陷于饱和区域了,训练也会加速。

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs.
作者把这种输入分布的变化叫做内部协方差偏移。并提出了 Batch Normalization,通过固定输入的均值和方差。

Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.
BN 除了能解决 internal covariate shift 的问题,还能够降低梯度对学习率,初始化参数设置的依赖。这使得我们可以使用较大的学习率,正则化模型,降低对 dropout 的需求,最后还保证网络能够使用具有饱和性的非线性激活函数。

Towards Reducing Internal Covariate Shift

whitening 白化操作

It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated.
使用白化 whitening 有助于模型收敛,白化是线性变化,转化为均值为0,方差为1,并且去相关性。

However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.
如果将白化与基于梯度下降的优化混合在一起,那么在执行梯度下降的过程中会受到标准化的参数更新的影响,这样会减弱甚至抵消梯度下降的产生的影响。

作者举了这样一个例子:
考虑一个输入 u 和一个可学习的参数 b 相加作为一个 layer. 通过减去均值进行标准化 \(\hat x=x-E[x]\), 其中 x=u+b. 则前向传播的过程:
\(x=u+b \rightarrow \hat x = x-E[x] \rightarrow loss\)
反向传播对参数 b 求导(不考虑 b 和 E[x] 的相关性):
\(\dfrac{\partial l}{\partial b}=\dfrac{\partial l}{\partial \hat x}\dfrac{\partial \hat x}{\partial b} = \dfrac{\partial l}{\partial \hat x}\)
那么 \(\Delta b = -\dfrac{\partial l}{\partial \hat x}\), 则对于参数 b 的更新: \(b \leftarrow \Delta b + b\).
那么经过了标准化、梯度下降更新参数之后:
\(u+(b+\Delta b)-E[u+(b+\Delta b)]=u+b-E[u+b]\)
这意味着这个 layer 的输出没有变化,损失 \(\dfrac{\partial l}{\partial \hat x}也没有变化\), 那么随着训练的进行,b会无限的增长???,而loss不变。

This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step.
如果规范化不仅中心处理(即减去均值),而且还对激活值进行缩放,问题会变得更严重。通过实验发现, 当归一化参数在梯度下降步骤之外进行,模型会爆炸。

进行白化操作,并且在优化时考虑标准化的问题

The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place. To address this issue, we would like to ensure that, for any parameter values, the network always produces activations with the desired distribution.Doing so would allow the gradient of the loss with respect to the model parameters to account for the normalization, and for its dependence on the model parameters Θ.
之所以会产生以上的问题,主要是梯度优化的过程中没有考虑到标准化操作的进行(不好实现)。为了解决这一问题,作者提出我们需要保证网络产生的激活总是有相同的分布。这样做允许损失值关于模型参数的梯度考虑到标准化。

再一次考虑 x 是一个 layer 的输入,看作一个向量,\(\chi\) 是整个训练集,则标准化: \(\hat x = Norm(x, \chi)\)

这时标准化的参数不仅取决于当前的输入x,还和整个训练集 \(\chi\) 有关,当x来自其它层的输出时,那么上式就会和前面层的网络参数 \(\theta\) 有关,反向传播时需要计算: \[\frac{\partial{Norm(x,\chi)}}{\partial{x}}\text{ and }\frac{\partial{Norm(x,\chi)}}{\partial{\chi}}\]

如果忽略上边第二项就会出现之前说到的问题。但是直接在这一架构下进行白话操作很非常的费时,代价很大。主要是需要计算协方差矩阵,进行归一化,以及反向传播时也需要进行相关的计算。因此这就需要寻找一种新的方法,既可以达到类似的效果,又不需要在每个参数更新后分析整个训练集。

Normalization via Mini-Batch Statistics

对比于白化的两个简化

Since the full whitening of each layer’s inputs is costly, we make two necessary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have zero mean and unit variance.
既然白化操作这么费时费力,作者考虑两点必要的简化。第一点,对输入特征的每一维 \(x=(x^{(1)},...,x^{(d)})\) 进行去均值和单位方差的处理。

\[\hat x^{(k)} = \dfrac{x^{(k)}-E[x^{(k)}]}{\sqrt {Var[x^{(k)}]}}\]

where the expectation and variance are computed over the training data set.
其中均值和方差是基于整个训练集计算得到的。

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.
但是如果仅是简单的对每一层的输入进行标准化可能会对该层的表达造成能力改变。比如对一个sigmoid激活函数的输入标准化会将输入固定在线性区域。

为了解决这一问题,作者提出了这样的改变,引入一对参数 \(\gamma^{(k)}\), \(\beta^{(k)}\) 来对归一化之后的值进行缩放和平移。
\[y^{(k)} = \gamma^{(k)}\hat x^{(k)} + \beta^{(k)}\] \(\gamma^{(k)}\), \(\beta^{(k)}\) 是可学习的参数,用来回复经过标准化之后的网络的表达能力。如果 \(\gamma^{(k)}=\sqrt {Var[x^{(k)}]}\), \(\beta^{(k)}=E[x^{(k)}]\)

In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when using stochastic optimization. Therefore, we make the second simplification: since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.
在batch中使用整个训练集的均值和方差是不切实际的,因此,作者提出了 第二个简化,用 mini-batch 来估计均值和方差。

Note that the use of mini-batches is enabled by computation of per-dimension variances rather than joint covariances; in the joint case, regularization would be required since the mini-batch size is likely to be smaller than the number of activations being whitened, resulting in singular covariance matrices.
注意到 mini-batches 是计算每一维的方差,而不是联合协方差。使用协方差就需要对模型进行正则化,mini-batches 的大小往往小于需要白化的激活值的数量,会得到 奇异协方差矩阵(singular vorariance matrices)???.

BN 核心流程

batch size m, 我们关注其中某一个维度 \(x^{k}\), k 表示第k维特征。那么对于 batch 中该维特征的 m 个值: \[B=\{x_{1,...,m}\}\] 经过线性转换: \[BN_{\gamma, \beta}:x_{1,..,m}\rightarrow y_{1,..,m}\]

  • 对于输入的 mini-batch 的一个维度,计算均值和方差
  • 标准化(注意 epsilon 避免0错误)
  • 使用两个参数进行平移和缩放

这里有点疑惑:为什么在第三步已经完成标准化的情况下还要进行4操作,后来发现其实作者在前文已经说了。首先 \(\hat x\) 是标准化后的输出,但是如果仅以此为输出,其输出就被限定为了标准正态分布,这样很可能会限制原始网络能表达的信息,前文已用sigmoid函数进行了举例说明。因为 \(\gamma, \beta\) 这两个参数是可以学习的,所以的标准化后的”恢复”程度将在训练的过程中由网络自主决定。

利用链式法则,求损失函数对参数 \(\gamma, \beta\) 求导:

Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal covariate shift, thus accelerating the training.
BN 是可微的,保证模型可训练,网络可以学习得到输入的分布,来减小 internal covarite shift, 从而加速训练。

Training and Inference with Batch-Normalized Networks

The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization
\(\hat x = \dfrac{x-E[x]}{\sqrt{Var[x]+\epsilon}}\)
using the population, rather than mini-batch, statistics.
在训练阶段和推理(inference)阶段不一样,这里的推理阶段指的就是测试阶段,在测试阶段使用总体的均值,而不是 mini-batch 的均值。

Using moving averages instead, we can track the accuracy of a model as it trains. Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.

Batch-Normalized Convolutional Networks

  • 第1-5步是算法1的流程,对每一维标准化,得到 \(N_{BN}^{tr}\)
  • 6-7步优化训练参数 \(\theta \bigcup \{\gamma^{k}, \beta^{k}\}\),在测试阶段参数是固定的
  • 8-12步骤是将训练阶段的统计信息转化为训练集整体的统计信息。因为完成训练后在预测阶段,我们使用的是模型存储的整体的统计信息。这里涉及到通过样本均值和方差估计总体的均值和方差的无偏估计,样本均值是等于总体均值的无偏估计的,而样本均值不等于总体均值的无偏估计。具体可看知乎上的解答 https://www.zhihu.com/question/20099757

Batch Normalization enables higher learning rates

In traditional deep networks, too high a learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima.
学习率过大容易发生梯度消失和梯度爆炸,从而陷入局部最小值。

By normalizing activations throughout the network, it prevents small changes in layer parameters from amplifying as the data propagates through a deep network.
通过规范化整个网络中的激活,可以防止层参数的微小变化在数据通过深层网络传播时放大。

Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagation and lead to the model explosion. However, with Batch Normalization, backpropagation through a layer is unaffected by the scale of its parameters.
BN 能让训练时的参数更有弹性。通常,学习率过大会增大网络参数,在反向传播中导致梯度过大而发生梯度爆炸。而 BN 使得网络不受参数的大小的影响。

正则化

除了可以更快地训练网络,BN层还有对模型起到正则化的作用。因为当训练一个BN网络的时候,对于一个给定的样本,它还可以”看到”一个batch中其他的情况,这样网络对于一个给定的样本输入每次就可以产生一个不确定的输出(因为标准化的过程和batch中其他的样本均有关联),作者通过实验证明这对减少模型的过拟合具有作用。

代码实现

tensorflow 已经封装好了 BN 层,可以直接通过 tf.contrib.layers.batch_norm() 调用,如果你想知道函数背后的具体实现方法,加深对BN层的理解,可以参考这篇文章Implementing Batch Normalization in Tensorflow

reference: