Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that
activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout.
通常深度神经网络在过参数化,并在训练时加上大量的噪声和正则化,比如权重衰减和 dropout,这个时候神经网络能很好的 work. 但是 dropout 对于全链接网络是一个非常有效的正则化技术,它对于卷积神经网络却没啥效果。这可能是因为卷积神经网络的激活是空间相关的,即使 drop 掉部分 unit,信息仍然会传递到下一层网络中去。
Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices.
作者为卷积神经网络提出了专门的正则化方式, dropblock. 同时 drop 掉一个连续的空间。作者发现将 dropblock 应用到 ResNet 能有效的提高准确率。同时增加 drop 的概率能提高参数的鲁棒性。
In this paper, we introduce DropBlock, a structured form of dropout, that is particularly effective to regularize convolutional networks. In DropBlock, features in a block, i.e., a contiguous region of a feature map, are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data (see Figure 1).
具体的算法很简单,主要关注两个参数的设置: block_size 和 $\gamma$.
block_size is the size of the block to be dropped
$\gamma$ controls how many activation units to drop.
We experimented with a shared DropBlock mask across different feature channels or each feature channel has its DropBlock mask. Algorithm 1 corresponds to the latter, which tends to work better in our experiments.
Similar to dropout we do not apply DropBlock during inference. This is interpreted as evaluating an averaged prediction across the exponentially-sized ensemble of sub-networks. These sub-networks include a special subset of sub-networks covered by dropout where each network does not see contiguous parts of feature maps.
关于 infer 时, dropblock 的处理和 dropout 类似。
block_size:
In our implementation, we set a constant block_size for all feature maps, regardless the resolution of feature map. DropBlock resembles dropout [1] when block_size = 1 and resembles SpatialDropout [20] when block_size covers the full feature map.
In practice, we do not explicitly set $\gamma$. As stated earlier, $\gamma$ controls the number of features to drop. Suppose that we want to keep every activation unit with the probability of keep_prob, in dropout [1] the binary mask will be sampled with the Bernoulli distribution with mean 1 − keep_prob. However, to account for the fact that every zero entry in the mask will be expanded by block_size2 and the blocks will be fully contained in feature map, we need to adjust $\gamma$ accordingly when we sample the initial binary mask. In our implementation, $\gamma$ can be computed as
The main nuance of DropBlock is that there will be some overlapped in the dropped blocks, so the above equation is only an approximation.
最主要的问题是,会出现 block_size 的重叠。所以上诉公式也只是个近似。
Scheduled DropBlock:
We found that DropBlock with a fixed keep_prob during training does not work well. Applying small value of keep_prob hurts learning at the beginning. Instead, gradually decreasing keep_prob over time from 1 to the target value is more robust and adds improvement for the most values of keep_prob.
In the following experiments, we study where to apply DropBlock in residual networks. We experimented with applying DropBlock only after convolution layers or applying DropBlock after both convolution layers and skip connections. To study the performance of DropBlock applying to different feature groups, we experimented with applying DropBlock to Group 4 or to both Groups 3 and 4.
params_grad = evaluate_gradient ( loss_function , example , params )
params = params - learning_rate * params_grad
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn
online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily.
批梯度下降的计算过于冗余,它在每一次参数更新之前的计算过程中会计算很多相似的样本。随机梯度下降则是每一次参数更新计算一个样本,因此更新速度会很快,并且可以在线学习。但是用于更新的梯度的方差会很大,导致 loss 曲线波动很大。
While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately
complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost
certainly converging to a local or the global minimum for non-convex and convex optimization respectively.
reduces the variance of the parameter updates, which can lead to more stable convergence;
减小参数更新的方差,使得收敛更稳定。
can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient mini-batch very efficient.
能非常好的利用矩阵优化的方式来加速计算,这在各种深度学习框架里面都很常见。
Challenges
Choosing a proper learning rate.
选择合适的学习率。
Learning rate schedules. try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.
the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring
Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [5] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
对于非凸损失函数的优化问题,需要避免陷入其众多的次优局部极小值。Dauphin et al. [5] 则认为, 相比局部极小值,鞍点的是更难解决的问题。鞍点是一个维度上升,一个维度下降。详细的关于鞍点以及 SGD 如何逃离鞍点可参考:知乎:如何逃离鞍点 .
Gradient descent optimization algorithms
Momentum [17] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by padding a fraction $gamma$ of the update vector of the past time step to the current
update vector.
Momentum
paper: [Neural networks :
the official journal of the International Neural Network Society]()
The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.
paper: [Yurii Nesterov. A method for unconstrained convex minimization problem
with the rate of convergence o(1/k2).]()
We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) [14] is a way to give our momentum term this kind of prescience.
paper: [Adaptive Subgradient Methods for Online Learning
and Stochastic Optimization]()
Adagrad [8] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.
对于不同的参数,自适应的调整对应的梯度大小。对低频参数或特征,使其更新的梯度较大,对高频的参数或特征,使其更新的梯度较小。比如在训练 Glove 词向量时,低频词在某一步迭代中可能并没有参与 loss 的计算,所以更新的会相对较慢,所以需要人为的增大它的梯度。
In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum:
similar like momentum:
$$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t$$
similar like autograd/RMSprop:
$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$
$m_t$ and $v_t$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As $m_t$ and $v_t$ are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates:
$$\hat m_t=\dfrac{m_t}{1-\beta^t_1}$$
$$\hat v_t=\dfrac{v_t}{1-\beta^t_2}$$
They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which
Norms for large p values generally become numerically unstable, which is why $l_1$ and $l_2$ norms are most common in practice. However, $l_{\infty}$ also generally exhibits stable behavior. For this reason, the authors propose AdaMax [10] and show that $v_t$ with $l_{\infty}$ converges to the following more stable value. To avoid confusion with Adam, we use ut to denote the infinity norm-constrained $v_t$:
Note that as $\mu_t$ relies on the max operation, it is not as suggestible to bias towards zero as $m_t$ and $v_t$ in Adam, which is why we do not need to compute a bias correction for ut. Good default values are again:
$$\eta = 0.002, \beta_1 = 0.9, \beta_2 = 0.999.$$
Visualization of algorithms
we see the path they took on the contours of a loss surface (the Beale function). All started at the same point and took different paths to reach the minimum. Note that Adagrad, Adadelta, and RMSprop headed off immediately in the right direction and converged similarly fast, while Momentum and NAG were led off-track, evoking the image of a ball rolling down the hill. NAG, however, was able to correct its course sooner due to its increased responsiveness by looking ahead and headed to the minimum.
如果目标函数是 Beale 这种类型的函数,自适应优化算法能更直接的收敛到最小值。而 Momentum 和 NAG 则偏离了轨道,就像球从山上滚下一样,刹不住车。但是 NAG 因为对未来具有一定的预见性,所以能更早的纠正从而提高其响应能力。
shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the latter two eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta leading the charge.
各种优化算法鞍点的表现。 Momentum, SGD, NAG 很难打破平衡,而自适应性的算法 Adadelta, RMSprop, Adadelta 能很快的逃离鞍点。
batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case.
关于 batch normalisztion.
从 Ng 的课上截来的一张图,全链接层相比卷积层更容易理解点,但形式上是一样的.
样本数量是 m,第 l 层经过激活函数输出是第 l+1 层的输入,其中第 i 个神经元的值:
线性输出: $z_i^l={w_i^l}^Th^l$.
非线性输出: $h_i^{l+1} = a_i^l=f(z_i^l+b_i^l)$
其中 f 是非线性激活函数,$a_i^l$ 是下一层的 summed inputs. 如果 $a_i^l$ 的分布变化较大(change in a highly correlated way),下一层的权重 $w^{l+1}$ 的梯度也会相应变化很大(反向传播中 $w^{l+1}$ 的梯度就是 $a_i^l$)。
batch normalization requires running averages of the summed input statistics. In feed-forward networks with fixed depth, it is straightforward to store the statistics separately for each hidden layer. However, the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps.
In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.
所以作者在这篇 paper 中提出了 Layer Normalization. 在单个样本上计算均值和方差进行归一化。然而是怎么进行的呢?
Layer Normalization
layer normalization 并不是在样本上求平均值和方差,而是在 hidden units 上求平均值和方差。
RNN is common among the NLP tasks to have different sentence lengths for different training cases. This is easy to deal with in an RNN because the same weights are used at every time-step. But when we apply batch normalization to an RNN in the obvious way, we need to to compute and store separate statistics for each time step in a sequence. This is problematic if a test sequence is longer than any of the training sequences. Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step. It also has only one set of gain and bias parameters shared over all time-steps.
这一部分也解释了 BN 不适用于 RNN 的原因,从 test sequence longer 的角度。RNN 的每个时间步计算共享参数权重.
"""LSTM unit with layer normalization and recurrent dropout. This class adds layer normalization and recurrent dropout to a basic LSTM unit. Layer normalization implementation is based on: https://arxiv.org/abs/1607.06450. "Layer Normalization" Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton and is applied before the internal nonlinearities. Recurrent dropout is base on: https://arxiv.org/abs/1603.05118 "Recurrent Dropout without Memory Loss" Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth. """
"""Initializes the basic LSTM cell. Args: num_units: int, The number of units in the LSTM cell. forget_bias: float, The bias added to forget gates (see above). input_size: Deprecated and unused. activation: Activation function of the inner states. layer_norm: If `True`, layer normalization will be applied. norm_gain: float, The layer normalization gain initial value. If `layer_norm` has been set to `False`, this argument will be ignored. norm_shift: float, The layer normalization shift initial value. If `layer_norm` has been set to `False`, this argument will be ignored. dropout_keep_prob: unit Tensor or float between 0 and 1 representing the recurrent dropout probability value. If float and 1.0, no dropout will be applied. dropout_prob_seed: (optional) integer, the randomness seed. reuse: (optional) Python boolean describing whether to reuse variables in an existing scope. If not `True`, and the existing scope already has the given variables, an error is raised. """
Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift.
stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers – so that small changes to the network parameters amplify as the network becomes deeper.
The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system
as a whole, to apply to its parts, such as a sub-network or a layer.
Therefore, the input distribution properties that aid the network generalization – such as having the same distribution between the training and test data – apply to training the sub-network as well.As such it is advantageous for the distribution of x to remain fixed over time.
有助于网络泛化的输入分布属性:例如在训练和测试数据之间具有相同的分布,也适用于训练子网络
Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the subnetwork, as well.
其中 u 是输入, g 是 sigmoid 激活函数 $g(x)=\dfrac{1}{1+exp(x)}$,当 |x| 增加时,$g’(x)$ 趋近于0, 这意味着 $x=Wu+b$ 的所有维度,除了绝对值较小的维度,其他的流向输入 u 的梯度都会消失,也就是进入非线性的饱和区域,这会降低模型训练速度。
在实际应用中,对于非线性饱和的情况,已经有很有对应策略:
ReLU
初始化 Xavier initialization.
用一个较小的学习速率进行学习
If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.
如果保证非线性输入的分布稳定,优化器也就不会陷于饱和区域了,训练也会加速。
We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs.
Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or
of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.
It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated.
However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.
This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step.
The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place. To address this issue, we would like to ensure that, for any parameter values, the network always produces activations with the desired distribution.Doing so would allow the gradient of the loss with respect to the model parameters to account for the normalization, and for its dependence on the model parameters Θ.
Since the full whitening of each layer’s inputs is costly, we make two necessary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have zero mean and unit variance.
where the expectation and variance are computed over the
training data set.
其中均值和方差是基于整个训练集计算得到的。
Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.
In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when using stochastic optimization. Therefore, we make the second simplification: since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.
Note that the use of mini-batches is enabled by computation of per-dimension variances rather than joint covariances; in the joint case, regularization would be required since the mini-batch size is likely to be smaller than the number of activations being whitened, resulting in singular covariance matrices.
Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal covariate shift, thus accelerating the training.
Training and Inference with Batch-Normalized Networks
The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want
the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization
$\hat x = \dfrac{x-E[x]}{\sqrt{Var[x]+\epsilon}}$
using the population, rather than mini-batch, statistics.
Using moving averages instead, we can track the accuracy of a model as it trains. Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.
In traditional deep networks, too high a learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima.
学习率过大容易发生梯度消失和梯度爆炸,从而陷入局部最小值。
By normalizing activations throughout the network, it prevents small changes in layer parameters from amplifying as the data propagates through a deep network.
通过规范化整个网络中的激活,可以防止层参数的微小变化在数据通过深层网络传播时放大。
Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagation and lead to the model explosion. However, with Batch Normalization, backpropagation through a layer is unaffected by the scale of its parameters.
Graves, Alex, et al. “A novel connectionist system for unconstrained handwriting recognition.” IEEE transactions on pattern analysis and machine intelligence 31.5 (2009): 855-868.
It may work better, especially in recurrent networks (Hinton)
对网络的响应加噪声
如在前向传播过程中,让默写神经元的输出变为 binary 或 random。显然,这种有点乱来的做法会打乱网络的训练过程,让训练更慢,但据 Hinton 说,在测试集上效果会有显著提升 (But it does significantly better on the test set!)。
结合多种模型
简而言之,训练多个模型,以每个模型的平均输出作为结果。
从 N 个模型里随机选择一个作为输出的期望误差 ,会比所有模型的平均输出的误差 大(我不知道公式里的圆括号为什么显示不了):
In order to avoid neurons becoming too correlated and ending up in poor local minimize, it is often helpful to randomly initialize parameters. 为了避免神经元高度相关和局部最优化,常常需要采用随机初始化权重参数,最常用的就是Xavier initiazation.
""" Args: shape: Tuple or 1-d array that species dimensions of requested tensor. Returns: out: tf.Tensor of specified shape sampled from Xavier distribution. """
epsilon = np.sqrt(6/np.sum(shape))
out = tf.Variable(tf.random_uniform(shape=shape, minval=-epsilon, maxval=epsilon))
To avoid parameters from exploding or becoming highly correlated, it is helpful to augment our cost function with a Gaussian prior: this tends to push parameter weights closer to zero, without constraining their direction, and often leads to classifiers with better generalization ability.
If we maximize log-likelihood (as with the cross-entropy loss, above), then the Gaussian prior becomes a quadratic term 1 (L2 regularization):