## Motivation

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift.

stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers – so that small changes to the network parameters amplify as the network becomes deeper.

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer.

Therefore, the input distribution properties that aid the network generalization – such as having the same distribution between the training and test data – apply to training the sub-network as well.As such it is advantageous for the distribution of x to remain fixed over time. 有助于网络泛化的输入分布属性：例如在训练和测试数据之间具有相同的分布，也适用于训练子网络

Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the subnetwork, as well.

$z=g(Wu+b)$

- ReLU
- 初始化 Xavier initialization.
- 用一个较小的学习速率进行学习

If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs.

Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.
BN 除了能解决 internal covariate shift 的问题，还能够降低梯度对学习率，初始化参数设置的依赖。这使得我们可以使用较大的学习率，正则化模型，降低对 dropout 的需求，最后还保证网络能够使用具有饱和性的非线性激活函数。

## Towards Reducing Internal Covariate Shift

### whitening 白化操作

It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated.

However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.

$x=u+b \rightarrow \hat x = x-E[x] \rightarrow loss$

$\dfrac{\partial l}{\partial b}=\dfrac{\partial l}{\partial \hat x}\dfrac{\partial \hat x}{\partial b} = \dfrac{\partial l}{\partial \hat x}$

$u+(b+\Delta b)-E[u+(b+\Delta b)]=u+b-E[u+b]$

This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step.

### 进行白化操作，并且在优化时考虑标准化的问题

The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place. To address this issue, we would like to ensure that, for any parameter values, the network always produces activations with the desired distribution.Doing so would allow the gradient of the loss with respect to the model parameters to account for the normalization, and for its dependence on the model parameters Θ.

## Normalization via Mini-Batch Statistics

### 对比于白化的两个简化

Since the full whitening of each layer’s inputs is costly, we make two necessary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have zero mean and unit variance.

$\hat x^{(k)} = \dfrac{x^{(k)}-E[x^{(k)}]}{\sqrt {Var[x^{(k)}]}}$

where the expectation and variance are computed over the training data set.

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.

$y^{(k)} = \gamma^{(k)}\hat x^{(k)} + \beta^{(k)}$ $\gamma^{(k)}$, $\beta^{(k)}$ 是可学习的参数，用来回复经过标准化之后的网络的表达能力。如果 $\gamma^{(k)}=\sqrt {Var[x^{(k)}]}$, $\beta^{(k)}=E[x^{(k)}]$

In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when using stochastic optimization. Therefore, we make the second simplification: since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.

Note that the use of mini-batches is enabled by computation of per-dimension variances rather than joint covariances; in the joint case, regularization would be required since the mini-batch size is likely to be smaller than the number of activations being whitened, resulting in singular covariance matrices.

### BN 核心流程

batch size m, 我们关注其中某一个维度 $x^{k}$, k 表示第k维特征。那么对于 batch 中该维特征的 m 个值： $B=\{x_{1,...,m}\}$ 经过线性转换： $BN_{\gamma, \beta}:x_{1,..,m}\rightarrow y_{1,..,m}$

• 对于输入的 mini-batch 的一个维度，计算均值和方差
• 标准化（注意 epsilon 避免0错误）
• 使用两个参数进行平移和缩放

Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal covariate shift, thus accelerating the training.
BN 是可微的，保证模型可训练，网络可以学习得到输入的分布，来减小 internal covarite shift, 从而加速训练。

### Training and Inference with Batch-Normalized Networks

The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization
$\hat x = \dfrac{x-E[x]}{\sqrt{Var[x]+\epsilon}}$
using the population, rather than mini-batch, statistics.

Using moving averages instead, we can track the accuracy of a model as it trains. Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.

### Batch-Normalized Convolutional Networks

• 第1-5步是算法1的流程，对每一维标准化，得到 $N_{BN}^{tr}$
• 6-7步优化训练参数 $\theta \bigcup \{\gamma^{k}, \beta^{k}\}$，在测试阶段参数是固定的
• 8-12步骤是将训练阶段的统计信息转化为训练集整体的统计信息。因为完成训练后在预测阶段，我们使用的是模型存储的整体的统计信息。这里涉及到通过样本均值和方差估计总体的均值和方差的无偏估计，样本均值是等于总体均值的无偏估计的，而样本均值不等于总体均值的无偏估计。具体可看知乎上的解答 https://www.zhihu.com/question/20099757

### Batch Normalization enables higher learning rates

In traditional deep networks, too high a learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima.

By normalizing activations throughout the network, it prevents small changes in layer parameters from amplifying as the data propagates through a deep network.

Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagation and lead to the model explosion. However, with Batch Normalization, backpropagation through a layer is unaffected by the scale of its parameters.
BN 能让训练时的参数更有弹性。通常，学习率过大会增大网络参数，在反向传播中导致梯度过大而发生梯度爆炸。而 BN 使得网络不受参数的大小的影响。

# 代码实现

tensorflow 已经封装好了 BN 层，可以直接通过 tf.contrib.layers.batch_norm() 调用，如果你想知道函数背后的具体实现方法，加深对BN层的理解，可以参考这篇文章Implementing Batch Normalization in Tensorflow