论文笔记-dropblock

paper:

dropblock 是关于 CNN 的,后两篇是关于 RNN 的正则化。

DropBlock

Motivation

Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that

activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout.

通常深度神经网络在过参数化,并在训练时加上大量的噪声和正则化,比如权重衰减和 dropout,这个时候神经网络能很好的 work. 但是 dropout 对于全链接网络是一个非常有效的正则化技术,它对于卷积神经网络却没啥效果。这可能是因为卷积神经网络的激活是空间相关的,即使 drop 掉部分 unit,信息仍然会传递到下一层网络中去。

Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices.

作者为卷积神经网络提出了专门的正则化方式, dropblock. 同时 drop 掉一个连续的空间。作者发现将 dropblock 应用到 ResNet 能有效的提高准确率。同时增加 drop 的概率能提高参数的鲁棒性。

回顾了一下 skip/shortcut connection: 目的是避免梯度消失。可以直接看 GRU 的公式:参考笔记

dropblock

In this paper, we introduce DropBlock, a structured form of dropout, that is particularly effective to regularize convolutional networks. In DropBlock, features in a block, i.e., a contiguous region of a feature map, are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data (see Figure 1).

具体的算法很简单,主要关注两个参数的设置: block_size 和 $\gamma$.

  • block_size is the size of the block to be dropped

  • $\gamma$ controls how many activation units to drop.

We experimented with a shared DropBlock mask across different feature channels or each feature channel has its DropBlock mask. Algorithm 1 corresponds to the latter, which tends to work better in our experiments.

对于 channels, 不同的 feature map 具有不同的 dropblock 相比所有的 channels 共享 dropblock 效果要好。

Similar to dropout we do not apply DropBlock during inference. This is interpreted as evaluating an averaged prediction across the exponentially-sized ensemble of sub-networks. These sub-networks include a special subset of sub-networks covered by dropout where each network does not see contiguous parts of feature maps.

关于 infer 时, dropblock 的处理和 dropout 类似。

block_size:

In our implementation, we set a constant block_size for all feature maps, regardless the resolution of feature map. DropBlock resembles dropout [1] when block_size = 1 and resembles SpatialDropout [20] when block_size covers the full feature map.

block_size 设置为 1 时, 类似于 dropout. 当 block_size 设置为整个 feature map 的 size 大小时,就类似于 SpatialDropout.

setting the value of $\gamma$:

In practice, we do not explicitly set $\gamma$. As stated earlier, $\gamma$ controls the number of features to drop. Suppose that we want to keep every activation unit with the probability of keep_prob, in dropout [1] the binary mask will be sampled with the Bernoulli distribution with mean 1 − keep_prob. However, to account for the fact that every zero entry in the mask will be expanded by block_size2 and the blocks will be fully contained in feature map, we need to adjust $\gamma$ accordingly when we sample the initial binary mask. In our implementation, $\gamma$ can be computed as

作者并没有显示的设置 $\gamma$. 对于 dropout,每一个 unit 满足概率为 keep_prob 的 Bernoulli 分布,但是对于 dropblock, 需要考虑到 block_size 的大小,以及其与 feature map size 的比例大小。

  • keep_prob 是传统的 dropout 的概率,通常设置为 0.75-0.9.

  • feat_size 是整个 feature map 的 size 大小。

  • (feat_size - block_size + 1) 是选择 dropblock 中心位置的有效区域。

The main nuance of DropBlock is that there will be some overlapped in the dropped blocks, so the above equation is only an approximation.

最主要的问题是,会出现 block_size 的重叠。所以上诉公式也只是个近似。

Scheduled DropBlock:

We found that DropBlock with a fixed keep_prob during training does not work well. Applying small value of keep_prob hurts learning at the beginning. Instead, gradually decreasing keep_prob over time from 1 to the target value is more robust and adds improvement for the most values of keep_prob.

定制化的设置 keep_prob, 在网络初期丢失特征会降低 preformance, 所以刚开始设置为 1,然后逐渐减小到 target value.

所以是随着网络深度加深而变化,还是随着迭代步数变化,应该是后者吧,类似于 scheduled learning rate.

Experiments

In the following experiments, we study where to apply DropBlock in residual networks. We experimented with applying DropBlock only after convolution layers or applying DropBlock after both convolution layers and skip connections. To study the performance of DropBlock applying to different feature groups, we experimented with applying DropBlock to Group 4 or to both Groups 3 and 4.

实验主要在讨论在哪儿加 dropblock 以及 如何在 channels 中加 dropblock。

Variational Dropout

深度学习-优化算法

paper: An overview of gradient descent optimization algorithms

Gradient descent variants

Batch gradient descent

computes the gradient of the cost function to the parameters $\theta$ for the entire training dataset.

$$\theta= \theta - \delta_{\theta}J(\theta)$$

1
2
3
4
5
6
7

for i in range ( nb_epochs ):

params_grad = evaluate_gradient ( loss_function , data , params )

params = params - learning_rate * params_grad

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Stochastic gradient descent

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i):

$$\theta= \theta - \delta_{\theta}J(\theta; x^{(i)}; y^{(i)})$$

1
2
3
4
5
6
7
8
9
10
11

for i in range ( nb_epochs ):

np. random . shuffle ( data )

for example in data :

params_grad = evaluate_gradient ( loss_function , example , params )

params = params - learning_rate * params_grad

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn

online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily.

批梯度下降的计算过于冗余,它在每一次参数更新之前的计算过程中会计算很多相似的样本。随机梯度下降则是每一次参数更新计算一个样本,因此更新速度会很快,并且可以在线学习。但是用于更新的梯度的方差会很大,导致 loss 曲线波动很大。

While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately

complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost

certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

批梯度下降收敛到的最小值与相应的参数关系很大(也就是说跟权重的初始化会有很大影响)。而 SGD 由于loss波动很大,更有效的跳出局部最优区域,从而获得更好的局部最优值。但另一方面,这也会使得 SGD 难以收敛。实验表明,缓慢的降低学习率, SGD 和 BatchGD 能获得同样的局部最优解。

Mini-batch gradient descent

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples.

$$\theta= \theta - \delta_{\theta}J(\theta; x^{(i+n)}; y^{(i+n)})$$

1
2
3
4
5
6
7
8
9
10
11

for i in range ( nb_epochs ):

np. random . shuffle ( data )

for batch in get_batches (data , batch_size =50):

params_grad = evaluate_gradient ( loss_function , batch , params )

params = params - learning_rate * params_grad

  • reduces the variance of the parameter updates, which can lead to more stable convergence;

减小参数更新的方差,使得收敛更稳定。

  • can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient mini-batch very efficient.

能非常好的利用矩阵优化的方式来加速计算,这在各种深度学习框架里面都很常见。

Challenges

  • Choosing a proper learning rate.

选择合适的学习率。

  • Learning rate schedules. try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.

学习率计划。在训练过程中调整学习率,譬如退火,预先定义好的计划,当一个 epoch 结束后,目标函数(loss) 减小的值低于某个阈值时,可以调整学习率。

  • the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring

features.

对所有的参数使用相同的学习率。如果你的数据是稀疏的,并且不同的特征的频率有很大的不同,这个时候我们并不希望对所有的参数使用相同的学习率,而是对更罕见的特征执行更大的学习率。

  • Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [5] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

对于非凸损失函数的优化问题,需要避免陷入其众多的次优局部极小值。Dauphin et al. [5] 则认为, 相比局部极小值,鞍点的是更难解决的问题。鞍点是一个维度上升,一个维度下降。详细的关于鞍点以及 SGD 如何逃离鞍点可参考:知乎:如何逃离鞍点 .

Gradient descent optimization algorithms

Momentum [17] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by padding a fraction $gamma$ of the update vector of the past time step to the current

update vector.

Momentum

paper: [Neural networks :

the official journal of the International Neural Network Society]()

without Momentum:

$$\theta += -lr * \nabla_{\theta}J(\theta)$$

with Momentum:

$$v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta)$$

$$\theta=\theta-v_t$$

动量梯度下降的理解:

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.

如上图中垂直方向的梯度方向是一致的,那么它的动量会累积,并在这个方向的速度越来越大。而在某个水平方向,其梯度方向总是变化,那么它的速度会减小,也就是在这个方向的波动幅度会得到抑制。

其实就是把梯度看做加速度,参数的更新量看做速度。速度表示一个step更新的大小。加速度总是朝着一个方向,速度必然越来越快。加速度方向总是变化,速度就会相对较小。

$\gamma$ 看做摩擦系数, 通常设置为 0.9。$\eta$ 是学习率。

Nesterov accelerate gradient(NAG)

paper: [Yurii Nesterov. A method for unconstrained convex minimization problem

with the rate of convergence o(1/k2).]()

We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) [14] is a way to give our momentum term this kind of prescience.

如果采用 momentum,在接近目标函数最优值时,由于速度在垂直方向是一直增加的,所以速度会很大,这个时候就会越过最小值,然后还得绕回来,增加了训练时间。所以我们需要参数的更新具有先见之明,知道在接近最优解时,降低参数更新的速度大小。

$$v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta-\gamma v_{t-1})$$

$$\theta=\theta-v_t$$

在 momentum 中,我们用速度 $\gamma v_{t-1}$ 来更新参数。 事实上在接近局部最优解时,目标函数对于 $\theta$ 的梯度会越来越小,甚至接近于 0. 也就是说,尽管速度在增加,但是速度增加的程度越来越小。我们可以通过速度增加的程度来判断是否要接近局部最优解了。$\nabla_{\theta}J(\theta-\gamma v_{t-1})$ 就表示速度变化的程度,代替一直为正的 $\nabla_{\theta}J(\theta)$,在接近局部最优解时,这个值应该是负的,相应的参数更新的速度也会减小.

在代码实现时,对于 $J(\theta-\gamma v_{t-1})$ 的梯度计算不是很方便,可以令:

$$\phi = \theta-\gamma v_{t-1}$$

然后进行计算,具体可参考 tensorflow 或 pytorch 中代码。

Adagrad

paper: [Adaptive Subgradient Methods for Online Learning

and Stochastic Optimization]()

Adagrad [8] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

对于不同的参数,自适应的调整对应的梯度大小。对低频参数或特征,使其更新的梯度较大,对高频的参数或特征,使其更新的梯度较小。比如在训练 Glove 词向量时,低频词在某一步迭代中可能并没有参与 loss 的计算,所以更新的会相对较慢,所以需要人为的增大它的梯度。

不同的时间步 t,不同的参数 i 对应的梯度:

$$g_{t,i}=\nabla_{\theta_t}J(\theta_t,i)$$

$$\theta_{t+1,i}=\theta_{t,i}-\eta \cdot g_{t,i}$$

$$\theta_{t+1,i}=\theta_{t,i}-\dfrac{\eta}{\sqrt G_{t,ii}+\epsilon} g_{t,i}$$

$G_{t,ii}$ 是对角矩阵,对角元素是对应的梯度大小。

1
2
3
4
5

cache += dx**2

x += -lr * dx/(np.sqrt(cache) + 1e-7)

RMSprop

Geoff Hinton Lecture 6e

Adagrad 中随着 cache 的累积,最后的梯度会变为 0,RMSprop 在此基础上进行了改进,给了 cache 一个衰减率,相当于值考虑了最近时刻的梯度值,而很早之前的梯度值经过衰减后影响很小。

$$E[g^2]_ t=0.9E[g^2]_ {t-1}+0.1g^2_t$$

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{E[g^2]_ t+\epsilon}g_t$$

1
2
3
4
5

cache = decay_rate*cache + (1-decay_rate)*dx**2

x += -lr * dx/(np.sqrt(cache) + 1e-7)

使用指数衰减的形式来保存 cache 能有效的节省内存,只需要记录当前的梯度值即可,而不用保存所有的梯度值。

Adam(Adaptive Moment Estimation)

Adam: a Method for Stochastic Optimization.

In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum:

similar like momentum:

$$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t$$

similar like autograd/RMSprop:

$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

$m_t$ and $v_t$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As $m_t$ and $v_t$ are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates:

$$\hat m_t=\dfrac{m_t}{1-\beta^t_1}$$

$$\hat v_t=\dfrac{v_t}{1-\beta^t_2}$$

They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which

yields the Adam update rule:

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{\sqrt{\hat a}+ \epsilon}{\hat m_t}$$

  • $m_t$ 是类似于 Momentum 中参数更新量,是梯度的函数. $\beta_1$ 是摩擦系数,一般设为 0.9.

  • $v_t$ 是类似于 RMSprop 中的 cache,用来自适应的改变不同参数的梯度大小。

  • $\beta_2$ 是 cache 的衰减系数,一般设为 0.999.

AdaMax

Adam: a Method for Stochastic Optimization.

在 Adam 中, 用来归一化梯度的因子 $v_t$ 与过去的梯度(包含在 $v_{t-1}$ 中)以及当前的梯度 $|g_t|^2$ 的 l2 范式成反比。

$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

可以将其泛化到 $l_p$ 范式。同样的 $\beta_2$ 变为 $\beta_2^p$.

Norms for large p values generally become numerically unstable, which is why $l_1$ and $l_2$ norms are most common in practice. However, $l_{\infty}$ also generally exhibits stable behavior. For this reason, the authors propose AdaMax [10] and show that $v_t$ with $l_{\infty}$ converges to the following more stable value. To avoid confusion with Adam, we use ut to denote the infinity norm-constrained $v_t$:

$$\mu_t=\beta_2^{\infty}v_{t-1}+(1-\beta_2^{\infty})|g_t|^{\infty}$$

$$=max(\beta_2\cdot v_{t-1}, |g_t|)$$

然后用 $\mu_t$ 代替 Adam 中的 $\sqrt(v_t)+\epsilon$:

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{\mu_t}{\hat m_t}$$

Note that as $\mu_t$ relies on the max operation, it is not as suggestible to bias towards zero as $m_t$ and $v_t$ in Adam, which is why we do not need to compute a bias correction for ut. Good default values are again:

$$\eta = 0.002, \beta_1 = 0.9, \beta_2 = 0.999.$$

Visualization of algorithms

we see the path they took on the contours of a loss surface (the Beale function). All started at the same point and took different paths to reach the minimum. Note that Adagrad, Adadelta, and RMSprop headed off immediately in the right direction and converged similarly fast, while Momentum and NAG were led off-track, evoking the image of a ball rolling down the hill. NAG, however, was able to correct its course sooner due to its increased responsiveness by looking ahead and headed to the minimum.

如果目标函数是 Beale 这种类型的函数,自适应优化算法能更直接的收敛到最小值。而 Momentum 和 NAG 则偏离了轨道,就像球从山上滚下一样,刹不住车。但是 NAG 因为对未来具有一定的预见性,所以能更早的纠正从而提高其响应能力。

shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the latter two eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta leading the charge.

各种优化算法鞍点的表现。 Momentum, SGD, NAG 很难打破平衡,而自适应性的算法 Adadelta, RMSprop, Adadelta 能很快的逃离鞍点。

example

model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

import torch

import torch.nn as nn

import torch.optim as optim



class TestNet(nn.Module):

def __init__(self):

super(TestNet, self).__init__()

self.linear1 = nn.Linear(10, 5)

self.linear2 = nn.Linear(5, 1)

self.loss = nn.BCELoss()



def forward(self, x, label):

"""

x: [batch, 10]

label: [batch]

"""

out = self.linear2(self.linear1(x)).squeeze()

loss = self.loss(out, label)

return out, loss

1
2
3
4
5

model = TestNet()

model

TestNet(

  (linear1): Linear(in_features=10, out_features=5, bias=True)

  (linear2): Linear(in_features=5, out_features=1, bias=True)

  (loss): BCELoss()

)
1
2
3

list(model.named_parameters())

[('linear1.weight', Parameter containing:

  tensor([[ 0.2901, -0.0022, -0.1515, -0.1064, -0.0475, -0.0324,  0.0404,  0.0266,

           -0.2358, -0.0433],

          [-0.1588, -0.1917,  0.0995,  0.0651, -0.2948, -0.1830,  0.2356,  0.1060,

            0.2172, -0.0367],

          [-0.0173,  0.2129,  0.3123,  0.0663,  0.2633, -0.2838,  0.3019, -0.2087,

           -0.0886,  0.0515],

          [ 0.1641, -0.2123, -0.0759,  0.1198,  0.0408, -0.0212,  0.3117, -0.2534,

           -0.1196, -0.3154],

          [ 0.2187,  0.1547, -0.0653, -0.2246, -0.0137,  0.2676,  0.1777,  0.0536,

           -0.3124,  0.2147]], requires_grad=True)),

 ('linear1.bias', Parameter containing:

  tensor([ 0.1216,  0.2846, -0.2002, -0.1236,  0.2806], requires_grad=True)),

 ('linear2.weight', Parameter containing:

  tensor([[-0.1652,  0.3056,  0.0749, -0.3633,  0.0692]], requires_grad=True)),

 ('linear2.bias', Parameter containing:

  tensor([0.0450], requires_grad=True))]

add model parameters to optimizer

1
2
3
4
5
6
7
8
9

import torch.optim as optim



# parameters = model.parameters()

parameters_filters = filter(lambda p: p.requires_grad, model.parameters())

1
2
3
4
5
6
7
8
9
10
11
12
13

optimizer = optim.Adam(

params=parameters_filters,

lr=0.001,

betas=(0.8, 0.999),

eps=1e-8,

weight_decay=3e-7)

1
2
3

optimizer.state_dict

<bound method Optimizer.state_dict of Adam (

Parameter Group 0

    amsgrad: False

    betas: (0.8, 0.999)

    eps: 1e-08

    lr: 0.001

    weight_decay: 3e-07

)>

不同的模块设置不同的参数

1
2
3
4
5

parameters = [{"params": model.linear1.parameters()},

{"params":model.linear2.parameters(), "lr": 3e-4}]

1
2
3
4
5
6
7
8
9
10
11
12
13

optimizer2 = optim.Adam(

params=parameters,

lr=0.001,

betas=(0.8, 0.999),

eps=1e-8,

weight_decay=3e-7)

1
2
3

optimizer2.state_dict

<bound method Optimizer.state_dict of Adam (

Parameter Group 0

    amsgrad: False

    betas: (0.8, 0.999)

    eps: 1e-08

    lr: 0.001

    weight_decay: 3e-07



Parameter Group 1

    amsgrad: False

    betas: (0.8, 0.999)

    eps: 1e-08

    lr: 0.0003

    weight_decay: 3e-07

)>

zero_grad

在进行反向传播之前,如果不需要梯度累加的话,必须要用zero_grad()清空梯度。具体的方法是遍历self.param_groups中全部参数,根据grad属性做清除。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

def zero_grad(self):

r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""

for group in self.param_groups:

for p in group['params']:

if p.grad is not None:

p.grad.detach_()

p.grad.zero_()

1
2
3
4
5

group_parameters = [{"params": model.linear1.parameters()},

{"params":model.linear2.parameters(), "lr": 3e-4}]

1
2
3
4
5
6
7
8
9

x = torch.randn(2, 10)

label = torch.Tensor([1,0])

out, loss = model(x, label)

loss.backward()

1
2
3

optimizer2.zero_grad()

1
2
3
4
5
6
7
8
9
10
11

for group in group_parameters:

for p in group["params"]:

if p.grad is not None:

p.grad.detach_()

p.grad.zero_()

这里并没有使用 backward() 所以暂时不存在梯度。

在反向传播 backward() 计算出梯度之后,就可以调用step()实现参数更新。不过在 Optimizer 类中,step()函数内部是空的,并且用raise NotImplementError 来作为提醒。后面会根据具体的优化器来分析step()的实现思路。

辅助类lr_scheduler

lr_scheduler用于在训练过程中根据轮次灵活调控学习率。调整学习率的方法有很多种,但是其使用方法是大致相同的:用一个Schedule把原始Optimizer装饰上,然后再输入一些相关参数,然后用这个Schedule做step()。

1
2
3
4
5
6
7

# lambda1 = lambda epoch: epoch // 30

lambda1 = lambda epoch: 0.95 ** epoch

scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda1)

1
2
3

scheduler.step()

warm up scheduler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

import math

parameters = filter(lambda p: p.requires_grad, model.parameters())



lr_warm_up_num = 1000



optimizer = optim.Adam(

params=parameters,

lr=0.001,

betas=(0.8, 0.999),

eps=1e-8,

weight_decay=3e-7)



cr = 1.0 / math.log(lr_warm_up_num)



scheduler = optim.lr_scheduler.LambdaLR(

optimizer,

lr_lambda=lambda ee: cr * math.log(ee + 1)

if ee < lr_warm_up_num else 1)

论文笔记-batch,layer,weights normalization

paper:

Batch Normalization

在之前的笔记已经详细看过了:深度学习-Batch Normalization

Layer Normalization

Motivation

batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case.

关于 batch normalisztion.

从 Ng 的课上截来的一张图,全链接层相比卷积层更容易理解点,但形式上是一样的.

样本数量是 m,第 l 层经过激活函数输出是第 l+1 层的输入,其中第 i 个神经元的值:

线性输出: $z_i^l={w_i^l}^Th^l$.

非线性输出: $h_i^{l+1} = a_i^l=f(z_i^l+b_i^l)$

其中 f 是非线性激活函数,$a_i^l$ 是下一层的 summed inputs. 如果 $a_i^l$ 的分布变化较大(change in a highly correlated way),下一层的权重 $w^{l+1}$ 的梯度也会相应变化很大(反向传播中 $w^{l+1}$ 的梯度就是 $a_i^l$)。

Batch Normalization 就是将线性输出归一化。

其中 $u_i^l$ 是均值,$\sigma_i^l$ 是方差。 $\overline a_i^l$ 是归一化之后的输出。 $g_i^l$ 是需要学习的参数,也就是 scale.

有个疑问?为什么 BN 要在激活函数之前进行,而不是之后进行呢?

上图中是单个样本,而所有的样本其实是共享层与层之间的参数的。样本与样本之间也存在差异,所以在某一个特征维度上进行归一化,(每一层其中的一个神经元可以看作一个特征维度)。

batch normalization requires running averages of the summed input statistics. In feed-forward networks with fixed depth, it is straightforward to store the statistics separately for each hidden layer. However, the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps.

BN 不是用于 RNN 是因为 batch 中的 sentence 长度不一致。我们可以把每一个时间步看作一个维度的特征提取,如果像 BN 一样在这个维度上进行归一化,显然在 RNN 上是行不通的。比如这个 batch 中最长的序列的最后一个时间步,他的均值就是它本身了,岂不是出现了 BN 在单个样本上训练的情况。

In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.

所以作者在这篇 paper 中提出了 Layer Normalization. 在单个样本上计算均值和方差进行归一化。然而是怎么进行的呢?

Layer Normalization

layer normalization 并不是在样本上求平均值和方差,而是在 hidden units 上求平均值和方差。

其中 H 是 hidden units 的个数。

BN 和 LN 的差异:

Layer normalisztion 在单个样本上取均值和方差,所以在训练和测试阶段都是一致的。

并且,尽管求均值和方差的方式不一样,但是在转换成 beta 和 gamma 的方式是一样的,都是在 channels 或者说 hidden_size 上进行的。

Layer normalized recurrent neural networks

RNN is common among the NLP tasks to have different sentence lengths for different training cases. This is easy to deal with in an RNN because the same weights are used at every time-step. But when we apply batch normalization to an RNN in the obvious way, we need to to compute and store separate statistics for each time step in a sequence. This is problematic if a test sequence is longer than any of the training sequences. Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step. It also has only one set of gain and bias parameters shared over all time-steps.

这一部分也解释了 BN 不适用于 RNN 的原因,从 test sequence longer 的角度。RNN 的每个时间步计算共享参数权重.

$a^t=W_{hh}h^{t-1}+W_{xh}x^t$

其中 b 和 g 是可学习的参数。

layer normalize 在 LSTM 上的使用:

tensorflow 实现

batch Normalization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

tf.reset_default_graph()

from tensorflow.python.training.moving_averages import assign_moving_average

from tensorflow.contrib.layers import batch_norm

### batch normalization

def batch_norm(inputs, decay=0.9, is_training=True, epsilon=1e-6):

"""



:param inputs: [batch, length, width, channels]

:param is_training:

:param eplison:

:return:

"""

pop_mean = tf.Variable(tf.zeros(inputs.shape[-1]), trainable=False, name="pop_mean")

pop_var = tf.Variable(tf.ones(inputs.shape[-1]), trainable=False, name="pop_variance")



def update_mean_and_var():

axes = list(range(inputs.shape.ndims))

batch_mean, batch_var = tf.nn.moments(inputs, axes=axes)

moving_average_mean = tf.assign(pop_mean, pop_mean * decay + batch_mean * (1-decay))

# 也可用 assign_moving_average(pop_mean, batch_mean, decay)

moving_average_var = tf.assign(pop_var, pop_var * decay + batch_var * (1-decay))

# 也可用 assign_moving_average(pop_var, batch_var, decay)

with tf.control_dependencies([moving_average_mean, moving_average_var]):

return tf.identity(batch_mean), tf.identity(batch_var)



mean, variance = tf.cond(tf.equal(is_training, True), update_mean_and_var,

lambda: (pop_mean, pop_var))

beta = tf.Variable(initial_value=tf.zeros(inputs.get_shape()[-1]), name="shift")

gamma = tf.Variable(initial_value=tf.ones(inputs.get_shape()[-1]), name="scale")

return tf.nn.batch_normalization(inputs, mean, variance, beta, gamma, epsilon)

layer normalization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

import tensorflow as tf



batch = 60

hidden_size = 64

whh = tf.random_normal(shape=[batch, hidden_size], mean=5.0, stddev=10.0)



whh_norm = tf.contrib.layers.layer_norm(inputs=whh, center=True, scale=True)

with tf.Session() as sess:

sess.run(tf.global_variables_initializer())

print(whh)

print(whh_norm)

print(sess.run([tf.reduce_mean(whh[0]), tf.reduce_mean(whh[1])]))

print(sess.run([tf.reduce_mean(whh_norm[0]), tf.reduce_mean(whh_norm[5]), tf.reduce_mean(whh_norm[59])]))

print(sess.run([tf.reduce_mean(whh_norm[:,0]), tf.reduce_mean(whh_norm[:,1]), tf.reduce_mean(whh_norm[:,63])]))

print("\n")

for var in tf.trainable_variables():

print(var)

print(sess.run(var))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

Tensor("random_normal:0", shape=(60, 64), dtype=float32)

Tensor("LayerNorm/batchnorm/add_1:0", shape=(60, 64), dtype=float32)

[5.3812757, 4.607581]

[-1.4901161e-08, -2.9802322e-08, -3.7252903e-09]

[-0.22264712, 0.14112064, -0.07268284]





<tf.Variable 'LayerNorm/beta:0' shape=(64,) dtype=float32_ref>

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

<tf.Variable 'LayerNorm/gamma:0' shape=(64,) dtype=float32_ref>

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

发现一个很奇怪的问题, layer norm 是在每一个训练样本上求均值和方差,为啥 beta 和 gamma 的shape却是 [hidden_size]. 按理说不应该是 [batch,] 吗? 带着疑问去看了源码,原来是这样的。。

将源码用简介的方式写出来了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

import tensorflow as tf

def layer_norm_mine(inputs, epsilon=1e-12, center=True, scale=True):

"""

inputs: [batch, sequence_len, hidden_size] or [batch, hidden_size]

"""

inputs_shape = inputs.shape

inputs_rank = inputs_shape.ndims

params_shape = inputs_shape[-1:]

beta, gamma = None, None

if center:

beta = tf.get_variable(

name="beta",

shape=params_shape,

initializer=tf.zeros_initializer(),

trainable=True

)

if scale:

gamma = tf.get_variable(

name="gamma",

shape=params_shape,

initializer=tf.ones_initializer(),

trainable=True

)

norm_axes = list(range(1, inputs_rank))

mean, variance = tf.nn.moments(inputs, norm_axes, keep_dims=True) # [batch]

inv = tf.rsqrt(variance + epsilon)

inv *= gamma

return inputs*inv + ((beta-mean)*inv if beta is not None else - mean * inv)



batch = 60

hidden_size = 64

whh = tf.random_normal(shape=[batch, hidden_size], mean=5.0, stddev=10.0)



whh_norm = layer_norm_mine(whh)

layer_norm_mine 得到的结果与源码一致。可以发现 计算均值和方差时, tf.nn.momentsaxes=[1:-1]. (tf.nn.moments 中 axes 的含义是在这些维度上求均值和方差). 也就是说得到的均值和方差确实是 [batch,]. 只是在转换成 beta 和 gamma 的分布时,依旧是在最后一个维度上进行的。有意思,所以最终的效果应该和 batch normalization 效果是一致的。只不过是否符合图像或文本的特性就另说了。

LayerNormBasicLSTMCell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245

class LayerNormBasicLSTMCell(rnn_cell_impl.RNNCell):

"""LSTM unit with layer normalization and recurrent dropout.



This class adds layer normalization and recurrent dropout to a

basic LSTM unit. Layer normalization implementation is based on:



https://arxiv.org/abs/1607.06450.



"Layer Normalization"

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton



and is applied before the internal nonlinearities.

Recurrent dropout is base on:



https://arxiv.org/abs/1603.05118



"Recurrent Dropout without Memory Loss"

Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth.

"""



def __init__(self,

num_units,

forget_bias=1.0,

input_size=None,

activation=math_ops.tanh,

layer_norm=True,

norm_gain=1.0,

norm_shift=0.0,

dropout_keep_prob=1.0,

dropout_prob_seed=None,

reuse=None):

"""Initializes the basic LSTM cell.



Args:

num_units: int, The number of units in the LSTM cell.

forget_bias: float, The bias added to forget gates (see above).

input_size: Deprecated and unused.

activation: Activation function of the inner states.

layer_norm: If `True`, layer normalization will be applied.

norm_gain: float, The layer normalization gain initial value. If

`layer_norm` has been set to `False`, this argument will be ignored.

norm_shift: float, The layer normalization shift initial value. If

`layer_norm` has been set to `False`, this argument will be ignored.

dropout_keep_prob: unit Tensor or float between 0 and 1 representing the

recurrent dropout probability value. If float and 1.0, no dropout will

be applied.

dropout_prob_seed: (optional) integer, the randomness seed.

reuse: (optional) Python boolean describing whether to reuse variables

in an existing scope. If not `True`, and the existing scope already has

the given variables, an error is raised.

"""

super(LayerNormBasicLSTMCell, self).__init__(_reuse=reuse)



if input_size is not None:

logging.warn("%s: The input_size parameter is deprecated.", self)



self._num_units = num_units

self._activation = activation

self._forget_bias = forget_bias

self._keep_prob = dropout_keep_prob

self._seed = dropout_prob_seed

self._layer_norm = layer_norm

self._norm_gain = norm_gain

self._norm_shift = norm_shift

self._reuse = reuse



@property

def state_size(self):

return rnn_cell_impl.LSTMStateTuple(self._num_units, self._num_units)



@property

def output_size(self):

return self._num_units



def _norm(self, inp, scope, dtype=dtypes.float32):

shape = inp.get_shape()[-1:]

gamma_init = init_ops.constant_initializer(self._norm_gain)

beta_init = init_ops.constant_initializer(self._norm_shift)

with vs.variable_scope(scope):

# Initialize beta and gamma for use by layer_norm.

vs.get_variable("gamma", shape=shape, initializer=gamma_init, dtype=dtype)

vs.get_variable("beta", shape=shape, initializer=beta_init, dtype=dtype)

normalized = layers.layer_norm(inp, reuse=True, scope=scope)

return normalized



def _linear(self, args):

out_size = 4 * self._num_units

proj_size = args.get_shape()[-1]

dtype = args.dtype

weights = vs.get_variable("kernel", [proj_size, out_size], dtype=dtype)

out = math_ops.matmul(args, weights)

if not self._layer_norm:

bias = vs.get_variable("bias", [out_size], dtype=dtype)

out = nn_ops.bias_add(out, bias)

return out



def call(self, inputs, state):

"""LSTM cell with layer normalization and recurrent dropout."""

c, h = state

args = array_ops.concat([inputs, h], 1)

concat = self._linear(args)

dtype = args.dtype



i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)

if self._layer_norm:

i = self._norm(i, "input", dtype=dtype)

j = self._norm(j, "transform", dtype=dtype)

f = self._norm(f, "forget", dtype=dtype)

o = self._norm(o, "output", dtype=dtype)



g = self._activation(j)

if (not isinstance(self._keep_prob, float)) or self._keep_prob < 1:

g = nn_ops.dropout(g, self._keep_prob, seed=self._seed)



new_c = (

c * math_ops.sigmoid(f + self._forget_bias) + math_ops.sigmoid(i) * g)

if self._layer_norm:

new_c = self._norm(new_c, "state", dtype=dtype)

new_h = self._activation(new_c) * math_ops.sigmoid(o)



new_state = rnn_cell_impl.LSTMStateTuple(new_c, new_h)

return new_h, new_state

深度学习-Batch Normalization

Paper Reading

Motivation

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift.

神经网络训练过程中参数不断改变导致后续每一层输入的分布也发生变化,而学习的过程又要使每一层适应输入的分布,这使得不得不降低学习率、小心地初始化,并且使得那些具有易饱和非线性激活函数的网络训练臭名昭著。作者将分布发生变化称之为 internal covariate shift。

stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers – so that small changes to the network parameters amplify as the network becomes deeper.

在深度学习中我们采用SGD取得了非常好的效果,SGD简单有效,但是它对超参数非常敏感,尤其是学习率和初始化参数。

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system

as a whole, to apply to its parts, such as a sub-network or a layer.

因为学习的过程中每一层需要去连续的适应每一层输入的分布,所以输入分布发生变化时,会产生一些问题。这里作者引用了 covariate shiftdomain adaptation 这两个概念。

Therefore, the input distribution properties that aid the network generalization – such as having the same distribution between the training and test data – apply to training the sub-network as well.As such it is advantageous for the distribution of x to remain fixed over time.

有助于网络泛化的输入分布属性:例如在训练和测试数据之间具有相同的分布,也适用于训练子网络

Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the subnetwork, as well.

固定输入分布对该子网络其他部分的网络的训练会产生积极的影响。

总结下为什么要使用 BN

在训练的过程中,因为前一层的参数改变,将会导致后一层的输入的分布不断地发生改变,这就需要降低学习速率同时要注意参数的初始化,也使具有饱和非线性(saturating nonlinearity)结构的模型非常难训练(所谓的饱和就是指函数的值域是个有限值,即当函数自变量趋向无穷时,函数值不趋向无穷)。深度神经网络之所以复杂是因为它每一层的输出都会受到之前层的影响,因此一个小小的参数改变都会对网络产生巨大的改变。作者将这种现象称为internal covariate shift,提出了对每个输入层进行规范化来解决。在文中,作者提到使用BN可以在训练的过程中使用较高的学习速率,可以比较随意的对参数进行初始化,同时BN也起到了一种正则化的作用,在某种程度上可以取代dropout的作用。

考虑一个以sigmoid为激活函数的神经层:

$z=g(Wu+b)$

其中 u 是输入, g 是 sigmoid 激活函数 $g(x)=\dfrac{1}{1+exp(x)}$,当 |x| 增加时,$g’(x)$ 趋近于0, 这意味着 $x=Wu+b$ 的所有维度,除了绝对值较小的维度,其他的流向输入 u 的梯度都会消失,也就是进入非线性的饱和区域,这会降低模型训练速度。

在实际应用中,对于非线性饱和的情况,已经有很有对应策略:

  • ReLU

  • 初始化 Xavier initialization.

  • 用一个较小的学习速率进行学习

If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.

如果保证非线性输入的分布稳定,优化器也就不会陷于饱和区域了,训练也会加速。

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs.

作者把这种输入分布的变化叫做内部协方差偏移。并提出了 Batch Normalization,通过固定输入的均值和方差。

Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or

of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.

BN 除了能解决 internal covariate shift 的问题,还能够降低梯度对学习率,初始化参数设置的依赖。这使得我们可以使用较大的学习率,正则化模型,降低对 dropout 的需求,最后还保证网络能够使用具有饱和性的非线性激活函数。

Towards Reducing Internal Covariate Shift

whitening 白化操作

It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated.

使用白化 whitening 有助于模型收敛,白化是线性变化,转化为均值为0,方差为1,并且去相关性。

However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.

如果将白化与基于梯度下降的优化混合在一起,那么在执行梯度下降的过程中会受到标准化的参数更新的影响,这样会减弱甚至抵消梯度下降的产生的影响。

作者举了这样一个例子:

考虑一个输入 u 和一个可学习的参数 b 相加作为一个 layer. 通过减去均值进行标准化 $\hat x=x-E[x]$, 其中 x=u+b. 则前向传播的过程:

$x=u+b \rightarrow \hat x = x-E[x] \rightarrow loss$

反向传播对参数 b 求导(不考虑 b 和 E[x] 的相关性):

$\dfrac{\partial l}{\partial b}=\dfrac{\partial l}{\partial \hat x}\dfrac{\partial \hat x}{\partial b} = \dfrac{\partial l}{\partial \hat x}$

那么 $\Delta b = -\dfrac{\partial l}{\partial \hat x}$, 则对于参数 b 的更新: $b \leftarrow \Delta b + b$.

那么经过了标准化、梯度下降更新参数之后:

$u+(b+\Delta b)-E[u+(b+\Delta b)]=u+b-E[u+b]$

这意味着这个 layer 的输出没有变化,损失 $\dfrac{\partial l}{\partial \hat x}也没有变化$, 那么随着训练的进行,**b会无限的增长???**,而loss不变。

This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step.

如果规范化不仅中心处理(即减去均值),而且还对激活值进行缩放,问题会变得更严重。通过实验发现, 当归一化参数在梯度下降步骤之外进行,模型会爆炸。

进行白化操作,并且在优化时考虑标准化的问题

The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place. To address this issue, we would like to ensure that, for any parameter values, the network always produces activations with the desired distribution.Doing so would allow the gradient of the loss with respect to the model parameters to account for the normalization, and for its dependence on the model parameters Θ.

之所以会产生以上的问题,主要是梯度优化的过程中没有考虑到标准化操作的进行(不好实现)。为了解决这一问题,作者提出我们需要保证网络产生的激活总是有相同的分布。这样做允许损失值关于模型参数的梯度考虑到标准化。

再一次考虑 x 是一个 layer 的输入,看作一个向量,$\chi$ 是整个训练集,则标准化:

$\hat x = Norm(x, \chi)$

这时标准化的参数不仅取决于当前的输入x,还和整个训练集 $\chi$ 有关,当x来自其它层的输出时,那么上式就会和前面层的网络参数 $\theta$ 有关,反向传播时需要计算:

$$\frac{\partial{Norm(x,\chi)}}{\partial{x}}\text{ and }\frac{\partial{Norm(x,\chi)}}{\partial{\chi}}$$

如果忽略上边第二项就会出现之前说到的问题。但是直接在这一架构下进行白话操作很非常的费时,代价很大。主要是需要计算协方差矩阵,进行归一化,以及反向传播时也需要进行相关的计算。因此这就需要寻找一种新的方法,既可以达到类似的效果,又不需要在每个参数更新后分析整个训练集。

Normalization via Mini-Batch Statistics

对比于白化的两个简化

Since the full whitening of each layer’s inputs is costly, we make two necessary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have zero mean and unit variance.

既然白化操作这么费时费力,作者考虑两点必要的简化。第一点,对输入特征的每一维 $x=(x^{(1)},…,x^{(d)})$ 进行去均值和单位方差的处理。

$$\hat x^{(k)} = \dfrac{x^{(k)}-E[x^{(k)}]}{\sqrt {Var[x^{(k)}]}}$$

where the expectation and variance are computed over the

training data set.

其中均值和方差是基于整个训练集计算得到的。

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.

但是如果仅是简单的对每一层的输入进行标准化可能会对该层的表达造成能力改变。比如对一个sigmoid激活函数的输入标准化会将输入固定在线性区域。

为了解决这一问题,作者提出了这样的改变,引入一对参数 $\gamma^{(k)}$, $\beta^{(k)}$ 来对归一化之后的值进行缩放和平移。

$$y^{(k)} = \gamma^{(k)}\hat x^{(k)} + \beta^{(k)}$$

$\gamma^{(k)}$, $\beta^{(k)}$ 是可学习的参数,用来回复经过标准化之后的网络的表达能力。如果 $\gamma^{(k)}=\sqrt {Var[x^{(k)}]}$, $\beta^{(k)}=E[x^{(k)}]$

In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when using stochastic optimization. Therefore, we make the second simplification: since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.

在batch中使用整个训练集的均值和方差是不切实际的,因此,作者提出了 第二个简化,用 mini-batch 来估计均值和方差。

Note that the use of mini-batches is enabled by computation of per-dimension variances rather than joint covariances; in the joint case, regularization would be required since the mini-batch size is likely to be smaller than the number of activations being whitened, resulting in singular covariance matrices.

注意到 mini-batches 是计算每一维的方差,而不是联合协方差。使用协方差就需要对模型进行正则化,mini-batches 的大小往往小于需要白化的激活值的数量,会得到 奇异协方差矩阵(singular vorariance matrices)???.

BN 核心流程

batch size m, 我们关注其中某一个维度 $x^{k}$, k 表示第k维特征。那么对于 batch 中该维特征的 m 个值:

$$B={x_{1,…,m}}$$

经过线性转换:

$$BN_{\gamma, \beta}:x_{1,..,m}\rightarrow y_{1,..,m}$$

  • 对于输入的 mini-batch 的一个维度,计算均值和方差

  • 标准化(注意 epsilon 避免0错误)

  • 使用两个参数进行平移和缩放

这里有点疑惑:为什么在第三步已经完成标准化的情况下还要进行4操作,后来发现其实作者在前文已经说了。首先 $\hat x$ 是标准化后的输出,但是如果仅以此为输出,其输出就被限定为了标准正态分布,这样很可能会限制原始网络能表达的信息,前文已用sigmoid函数进行了举例说明。因为 $\gamma, \beta$ 这两个参数是可以学习的,所以的标准化后的”恢复”程度将在训练的过程中由网络自主决定。

利用链式法则,求损失函数对参数 $\gamma, \beta$ 求导:

Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal covariate shift, thus accelerating the training.

BN 是可微的,保证模型可训练,网络可以学习得到输入的分布,来减小 internal covarite shift, 从而加速训练。

Training and Inference with Batch-Normalized Networks

The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want

the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization

$\hat x = \dfrac{x-E[x]}{\sqrt{Var[x]+\epsilon}}$

using the population, rather than mini-batch, statistics.

在训练阶段和推理(inference)阶段不一样,这里的推理阶段指的就是测试阶段,在测试阶段使用总体的均值,而不是 mini-batch 的均值。

Using moving averages instead, we can track the accuracy of a model as it trains. Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.

Batch-Normalized Convolutional Networks

  • 第1-5步是算法1的流程,对每一维标准化,得到 $N_{BN}^{tr}$

  • 6-7步优化训练参数 $\theta \bigcup {\gamma^{k}, \beta^{k}}$,在测试阶段参数是固定的

  • 8-12步骤是将训练阶段的统计信息转化为训练集整体的统计信息。因为完成训练后在预测阶段,我们使用的是模型存储的整体的统计信息。这里涉及到通过样本均值和方差估计总体的均值和方差的无偏估计,样本均值是等于总体均值的无偏估计的,而样本均值不等于总体均值的无偏估计。具体可看知乎上的解答 https://www.zhihu.com/question/20099757

Batch Normalization enables higher learning rates

In traditional deep networks, too high a learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima.

学习率过大容易发生梯度消失和梯度爆炸,从而陷入局部最小值。

By normalizing activations throughout the network, it prevents small changes in layer parameters from amplifying as the data propagates through a deep network.

通过规范化整个网络中的激活,可以防止层参数的微小变化在数据通过深层网络传播时放大。

Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagation and lead to the model explosion. However, with Batch Normalization, backpropagation through a layer is unaffected by the scale of its parameters.

BN 能让训练时的参数更有弹性。通常,学习率过大会增大网络参数,在反向传播中导致梯度过大而发生梯度爆炸。而 BN 使得网络不受参数的大小的影响。

正则化

除了可以更快地训练网络,BN层还有对模型起到正则化的作用。因为当训练一个BN网络的时候,对于一个给定的样本,它还可以”看到”一个batch中其他的情况,这样网络对于一个给定的样本输入每次就可以产生一个不确定的输出(因为标准化的过程和batch中其他的样本均有关联),作者通过实验证明这对减少模型的过拟合具有作用。

代码实现

tensorflow 已经封装好了 BN 层,可以直接通过 tf.contrib.layers.batch_norm() 调用,如果你想知道函数背后的具体实现方法,加深对BN层的理解,可以参考这篇文章Implementing Batch Normalization in Tensorflow

reference:

机器学习-过拟合

过拟合的原理以及解决方法。

Overfitting

过拟合(overfitting)是指在模型参数拟合过程中的问题,由于训练数据包含抽样误差,训练时,复杂的模型将抽样误差也考虑在内,将抽样误差也进行了很好的拟合。

具体表现就是最终模型在训练集上效果好;在测试集上效果差。模型泛化能力弱。

为什么要解决过拟合

为什么要解决过拟合现象?这是因为我们拟合的模型一般是用来预测未知的结果(不在训练集内),过拟合虽然在训练集上效果好,但是在实际使用时(测试集)效果差。同时,在很多问题上,我们无法穷尽所有状态,不可能将所有情况都包含在训练集上。所以,必须要解决过拟合问题。

为什么在机器学习中比较常见?这是因为机器学习算法为了满足尽可能复杂的任务,其模型的拟合能力一般远远高于问题复杂度,也就是说,机器学习算法有「拟合出正确规则的前提下,进一步拟合噪声」的能力。

而传统的函数拟合问题(如机器人系统辨识),一般都是通过经验、物理、数学等推导出一个含参模型,模型复杂度确定了,只需要调整个别参数即可。模型「无多余能力」拟合噪声。

解决方法

获取更多数据

这是解决过拟合最有效的方法,只要给足够多的数据,让模型「看见」尽可能多的「例外情况」,它就会不断修正自己,从而得到更好的结果:

如何获取更多数据,可以有以下几个方法:

  • 从数据源头获取更多数据:这个是容易想到的,例如物体分类,我就再多拍几张照片好了;但是,在很多情况下,大幅增加数据本身就不容易;另外,我们不清楚获取多少数据才算够;

  • 根据当前数据集估计数据分布参数,使用该分布产生更多数据:这个一般不用,因为估计分布参数的过程也会代入抽样误差。

  • 数据增强(Data Augmentation):通过一定规则扩充数据。如在物体分类问题里,物体在图像中的位置、姿态、尺度,整体图片明暗度等都不会影响分类结果。我们就可以通过图像平移、翻转、缩放、切割等手段将数据库成倍扩充;

使用合适的模型

前面说了,过拟合主要是有两个原因造成的:数据太少+模型太复杂。所以,我们可以通过使用合适复杂度的模型来防止过拟合问题,让其足够拟合真正的规则,同时又不至于拟合太多抽样误差。

(PS:如果能通过物理、数学建模,确定模型复杂度,这是最好的方法,这也就是为什么深度学习这么火的现在,我还坚持说初学者要学掌握传统的建模方法。)

对于神经网络而言,我们可以从以下四个方面来限制网络能力:

网络结构 Architecture

这个很好理解,减少网络的层数、神经元个数等均可以限制网络的拟合能力;

训练时间 Early stopping

对于每个神经元而言,其激活函数在不同区间的性能是不同的:

当网络权值较小时,神经元的激活函数工作在线性区,此时神经元的拟合能力较弱(类似线性神经元)。

有了上述共识之后,我们就可以解释为什么限制训练时间(early stopping)有用:因为我们在初始化网络的时候一般都是初始为较小的权值。训练时间越长,部分网络权值可能越大。如果我们在合适时间停止训练,就可以将网络的能力限制在一定范围内。

限制权值 Weight-decay,也叫正则化(regularization)

原理同上,但是这类方法直接将权值的大小加入到 Cost 里,在训练的时候限制权值变大。以 L2 regularization为例:

训练过程需要降低整体的 Cost,这时候,一方面能降低实际输出与样本之间的误差 ,也能降低权值大小。

增加噪声 Noise

给网络加噪声也有很多方法:

在输入中加噪声:

噪声会随着网络传播,按照权值的平方放大,并传播到输出层,对误差 Cost 产生影响。推导直接看 Hinton 的 PPT 吧:

在输入中加高斯噪声,会在输出中生成 的干扰项。训练时,减小误差,同时也会对噪声产生的干扰项进行惩罚,达到减小权值的平方的目的,达到与 L2 regularization 类似的效果(对比公式)。

在权值上加噪声

在初始化网络的时候,用0均值的高斯分布作为初始化。Alex Graves 的手写识别 RNN 就是用了这个方法

Graves, Alex, et al. “A novel connectionist system for unconstrained handwriting recognition.” IEEE transactions on pattern analysis and machine intelligence 31.5 (2009): 855-868.

  • It may work better, especially in recurrent networks (Hinton)

对网络的响应加噪声

如在前向传播过程中,让默写神经元的输出变为 binary 或 random。显然,这种有点乱来的做法会打乱网络的训练过程,让训练更慢,但据 Hinton 说,在测试集上效果会有显著提升 (But it does significantly better on the test set!)。

结合多种模型

简而言之,训练多个模型,以每个模型的平均输出作为结果。

从 N 个模型里随机选择一个作为输出的期望误差 ,会比所有模型的平均输出的误差 大(我不知道公式里的圆括号为什么显示不了):

大概基于这个原理,就可以有很多方法了:

Bagging

简单理解,就是分段函数的概念:用不同的模型拟合不同部分的训练集。以随机森林(Rand Forests)为例,就是训练了一堆互不关联的决策树。但由于训练神经网络本身就需要耗费较多自由,所以一般不单独使用神经网络做Bagging。

Boosting

既然训练复杂神经网络比较慢,那我们就可以只使用简单的神经网络(层数、神经元数限制等)。通过训练一系列简单的神经网络,加权平均其输出。

Dropout

在训练时,每次随机(如50%概率)忽略隐层的某些节点;这样,我们相当于随机从2^H个模型中采样选择模型;同时,由于每个网络只见过一个训练数据(每次都是随机的新网络),所以类似 bagging 的做法,这就是我为什么将它分类到「结合多种模型」中;

此外,而不同模型之间权值共享(共同使用这 H 个神经元的连接权值),相当于一种权值正则方法,实际效果比 L2 regularization 更好。

贝叶斯方法

总结

深度学习-权重初始化

  • 为什么要权重初始化

  • Xavier初始化的推导

权重初始化

In order to avoid neurons becoming too correlated and ending up in poor local minimize, it is often helpful to randomly initialize parameters. 为了避免神经元高度相关和局部最优化,常常需要采用随机初始化权重参数,最常用的就是Xavier initiazation.

为什么我们需要权重初始化?

如果权重参数很小的话,输入信号在前向传播过程中会不断减小(在0到1之间),那么每一层layer都会使得输入变小。同样的道理,如果权重参数过大的话,也会造成前向输入越来越大。这样会带来什么样的后果呢?以激活函数sogmoid为例:

如果以sigmoid为激活函数,我们可以发现,在每一层layer输出 $W^Tx$ ,也就是激活函数的输入,其值越接近于0的时候,函数近似于线性的,因而就失去了非线性的性质。这种情况下,我们就失去了多层神经网络的优势了。

如果初始权重过大,在前向传播的过程中,输入数据的方差variance会增长很快。怎么理解这句话?

以one layer为例,假设输入是 $x\in R^{1000}$, 线性输出是 $y\in R^{100}$.

$$y_j=w_{j,1}x_1+w_{j,2}x_2+…+w_{(j,1000)}x_{1000}$$

x可以看作是1000维的正态分布,每一维 $x_i\sim N(0,1)$, 如果 $w_j$值很大,比如 $w_j=[100,100,…,100]$,那么输出神经元 $y_i$ 的方差就是10000,所以就会很大,均值还是0.

那么激活函数的输入很有可能是一个远小于-1或远大于1的数,通过激活函数所得的值会非常接近于0或者1,也就是隐藏层神经元处于饱和状态(saturated),其梯度也就接近于0了。

所以初始化权重狠狠狠重要。那么应该如何初始化呢,也就是需要保证经过每一层layer,要保证线性输出的方差保持不变。这样就可以避免数值溢出,或是梯度消失。

Xavier Initialization

我们的目的是保持线性输出的方差不变。

以线性输出的一个神经元为例,也就是y的一个维度:

$$y_j=w_{j,1}x_1+w_{j,2}x_2+…+w_{j,N} x_N+b$$

其方差:

$$var(y_j) = var(w_{j,1}x_1+w_{j,2}x_2+…+w_{j,N} x_N+b)$$

其中每一项根据方差公式可得:

$$var(w_{j,i}x_i) = E(x_i)^2var(w_{j,i}) + E(w_{j,i})^2var(xi) + var(w_{j,i})var(x_i)$$

来自维基百科: https://en.wikipedia.org/wiki/Variance

其中我们假设输入和权重都是来自于均值为0的正态分布。

$$var(w_{j,i}x_i)=var(w_{j,i})var(x_i)$$

其中b是常量,那么:

$$var(y_j) = var(w_{j,1})var(x_1) + … + var(w_{j,N})var(x_N)$$

因为 $x_1,x_2,..,x_N$ 都是相同的分布,$W_{j,i}$ 也是,那么就有:

$$var(y_j) = N * var(w{j,i}) * var(x_i)$$

可以看到,如果输入神经元数目N很大,参数权重W的值也很大的话,会造成线性输出的值的方差很大。

我们需要保证 $y_j$ 的方差和 $x_j$ 的方差一样,所以:

$$N*var(W_{j,i})=1$$

$$var(W_{j,i})=1/N$$

There we go! 这样我们就得到了Xavier initialization的初始化公式,也就是说参数权重初始化为均值为0,方差为 1/N 的高斯分布,其中N表示当前层输入神经元的个数。在caffe中就是这样实现的。

更多初始化方式

Understanding the difficulty of training deep feedforward neural networks 在这篇paper中提出

$$var(w)=2/(N_{in}+N_{out})$$

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification 针对一种专门的初始化方式,使得 $var(w)=2.0/N$, 在实际工程中通常使用这种方式。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

### 正态分布

w = np.random.randn(N) * sqrt(2.0/N)

### 均匀分布

def _xavier_initializer(shape, **kwargs):

"""

Args:

shape: Tuple or 1-d array that species dimensions of requested tensor.

Returns:

out: tf.Tensor of specified shape sampled from Xavier distribution.

"""

epsilon = np.sqrt(6/np.sum(shape))

out = tf.Variable(tf.random_uniform(shape=shape, minval=-epsilon, maxval=epsilon))

return out

均匀分布[a,b]的方差:$\dfrac{(b-a)^2}{12}$

参考资料:

深度学习-Dropout

dropout的数学原理。

Dropout

随机失活(Dropout)

是一个简单又极其有效的正则化方法。该方法由Srivastava在论文Dropout: A Simple Way to Prevent Neural Networks from Overfitting中提出的,与L1正则化,L2正则化和最大范式约束等方法互为补充。在训练的时候,随机失活的实现方法是让神经元以超参数p的概率被激活或者被设置为0。

在训练过程中,随机失活可以被认为是对完整的神经网络抽样出一些子集,每次基于输入数据只更新子网络的参数(然而,数量巨大的子网络们并不是相互独立的,因为它们都共享参数)。在测试过程中不使用随机失活,可以理解为是对数量巨大的子网络们做了模型集成(model ensemble),以此来计算出一个平均的预测。

关于dropout的理解:知乎上的回答

python代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

""" 普通版随机失活: 不推荐实现 (看下面笔记) """



p = 0.5 # 激活神经元的概率. p值更高 = 随机失活更弱



def train_step(X):

""" X中是输入数据 """



# 3层neural network的前向传播

H1 = np.maximum(0, np.dot(W1, X) + b1)

U1 = np.random.rand(*H1.shape) < p # 第一个随机失活遮罩,rand() [0,1)的随机数

H1 *= U1 # drop!

H2 = np.maximum(0, np.dot(W2, H1) + b2)

U2 = np.random.rand(*H2.shape) < p # 第二个随机失活遮罩

H2 *= U2 # drop!

out = np.dot(W3, H2) + b3



# 反向传播:计算梯度... (略)

# 进行参数更新... (略)



def predict(X):

# 前向传播时模型集成

H1 = np.maximum(0, np.dot(W1, X) + b1) * p # 注意:激活数据要乘以p

H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # 注意:激活数据要乘以p

out = np.dot(W3, H2) + b3

在上面的代码中,train_step函数在第一个隐层和第二个隐层上进行了两次随机失活。在输入层上面进行随机失活也是可以的,为此需要为输入数据X创建一个二值的遮罩。反向传播保持不变,但是肯定需要将遮罩U1和U2加入进去。

注意:在predict函数中不进行随机失活,但是对于两个隐层的输出都要乘以p,调整其数值范围。这一点非常重要,因为在测试时所有的神经元都能看见它们的输入,因此我们想要神经元的输出与训练时的预期输出是一致的。以p=0.5为例,在测试时神经元必须把它们的输出减半,这是因为在训练的时候它们的输出只有一半。为了理解这点,先假设有一个神经元x的输出,那么进行随机失活的时候,该神经元的输出就是px+(1-p)0,这是有1-p的概率神经元的输出为0。在测试时神经元总是激活的,就必须调整x\to px来保持同样的预期输出。在测试时会在所有可能的二值遮罩(也就是数量庞大的所有子网络)中迭代并计算它们的协作预测,进行这种减弱的操作也可以认为是与之相关的。

反向随机失活

它是在训练时就进行数值范围调整,从而让前向传播在测试时保持不变。这样做还有一个好处,无论你决定是否使用随机失活,预测方法的代码可以保持不变。反向随机失活的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

"""

反向随机失活: 推荐实现方式.

在训练的时候drop和调整数值范围,测试时不做任何事.

"""



p = 0.5 # 激活神经元的概率. p值更高 = 随机失活更弱



def train_step(X):

# 3层neural network的前向传播

H1 = np.maximum(0, np.dot(W1, X) + b1)

U1 = (np.random.rand(*H1.shape) < p) / p # 第一个随机失活遮罩. 注意/p!

H1 *= U1 # drop!

H2 = np.maximum(0, np.dot(W2, H1) + b2)

U2 = (np.random.rand(*H2.shape) < p) / p # 第二个随机失活遮罩. 注意/p!

H2 *= U2 # drop!

out = np.dot(W3, H2) + b3



# 反向传播:计算梯度... (略)

# 进行参数更新... (略)



def predict(X):

# 前向传播时模型集成

H1 = np.maximum(0, np.dot(W1, X) + b1) # 不用数值范围调整了

H2 = np.maximum(0, np.dot(W2, H1) + b2)

out = np.dot(W3, H2) + b3

在随机失活发布后,很快有大量研究为什么它的实践效果如此之好,以及它和其他正则化方法之间的关系。如果你感兴趣,可以看看这些文献:

Dropout paper by Srivastava et al. 2014.

Dropout Training as Adaptive Regularization:“我们认为:在使用费希尔信息矩阵(fisher information matrix)的对角逆矩阵的期望对特征进行数值范围调整后,再进行L2正则化这一操作,与随机失活正则化是一阶相等的。”

机器学习中的一些 tricks

L2正则化的数学原理

L2正则化:

To avoid parameters from exploding or becoming highly correlated, it is helpful to augment our cost function with a Gaussian prior: this tends to push parameter weights closer to zero, without constraining their direction, and often leads to classifiers with better generalization ability.

If we maximize log-likelihood (as with the cross-entropy loss, above), then the Gaussian prior becomes a quadratic term 1 (L2 regularization):

$$J_{reg}(\theta)=\dfrac{\lambda}{2}[\sum_{i,j}{W_1}{i,j}^2+\sum{i’j’}{W_2}_{i,j}^2]$$

可以证明: 

$$W_{ij} ∼ N (0; 1=λ)$$

从两种角度理解正则化:知乎

RNN为什么容易出现梯度消失和梯度爆炸问题

relu为啥能有效的解决梯度消失的问题

很难理解为啥用relu能很好的解决梯度消失的问题,的确relu的梯度为1,但这也太简单了吧。。。所以得看看原论文 A Simple Way to Initialize Recurrent Networks of Rectified Linear Units