深度学习-优化算法

paper: An overview of gradient descent optimization algorithms

Gradient descent variants

Batch gradient descent

computes the gradient of the cost function to the parameters \(\theta\) for the entire training dataset.

\[\theta= \theta - \delta_{\theta}J(\theta)\]

1
2
3
for i in range ( nb_epochs ):
params_grad = evaluate_gradient ( loss_function , data , params )
params = params - learning_rate * params_grad

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Stochastic gradient descent

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i):

\[\theta= \theta - \delta_{\theta}J(\theta; x^{(i)}; y^{(i)})\]

1
2
3
4
5
for i in range ( nb_epochs ):
np. random . shuffle ( data )
for example in data :
params_grad = evaluate_gradient ( loss_function , example , params )
params = params - learning_rate * params_grad

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily.
批梯度下降的计算过于冗余,它在每一次参数更新之前的计算过程中会计算很多相似的样本。随机梯度下降则是每一次参数更新计算一个样本,因此更新速度会很快,并且可以在线学习。但是用于更新的梯度的方差会很大,导致 loss 曲线波动很大。

While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.
批梯度下降收敛到的最小值与相应的参数关系很大(也就是说跟权重的初始化会有很大影响)。而 SGD 由于loss波动很大,更有效的跳出局部最优区域,从而获得更好的局部最优值。但另一方面,这也会使得 SGD 难以收敛。实验表明,缓慢的降低学习率, SGD 和 BatchGD 能获得同样的局部最优解。

Mini-batch gradient descent

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples.

\[\theta= \theta - \delta_{\theta}J(\theta; x^{(i+n)}; y^{(i+n)})\]

1
2
3
4
5
for i in range ( nb_epochs ):
np. random . shuffle ( data )
for batch in get_batches (data , batch_size =50):
params_grad = evaluate_gradient ( loss_function , batch , params )
params = params - learning_rate * params_grad
  • reduces the variance of the parameter updates, which can lead to more stable convergence;
    减小参数更新的方差,使得收敛更稳定。
  • can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient mini-batch very efficient.
    能非常好的利用矩阵优化的方式来加速计算,这在各种深度学习框架里面都很常见。

Challenges

  • Choosing a proper learning rate.
    选择合适的学习率。

  • Learning rate schedules. try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.
    学习率计划。在训练过程中调整学习率,譬如退火,预先定义好的计划,当一个 epoch 结束后,目标函数(loss) 减小的值低于某个阈值时,可以调整学习率。

  • the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
    对所有的参数使用相同的学习率。如果你的数据是稀疏的,并且不同的特征的频率有很大的不同,这个时候我们并不希望对所有的参数使用相同的学习率,而是对更罕见的特征执行更大的学习率。

  • Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [5] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
    对于非凸损失函数的优化问题,需要避免陷入其众多的次优局部极小值。Dauphin et al. [5] 则认为, 相比局部极小值,鞍点的是更难解决的问题。鞍点是一个维度上升,一个维度下降。详细的关于鞍点以及 SGD 如何逃离鞍点可参考:知乎:如何逃离鞍点 .

Gradient descent optimization algorithms

Momentum [17] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by padding a fraction \(gamma\) of the update vector of the past time step to the current update vector.

Momentum

paper: Neural networks : the official journal of the International Neural Network Society

without Momentum: \[\theta += -lr * \nabla_{\theta}J(\theta)\]

with Momentum: \[v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta)\] \[\theta=\theta-v_t\]

动量梯度下降的理解: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.
如上图中垂直方向的梯度方向是一致的,那么它的动量会累积,并在这个方向的速度越来越大。而在某个水平方向,其梯度方向总是变化,那么它的速度会减小,也就是在这个方向的波动幅度会得到抑制。

其实就是把梯度看做加速度,参数的更新量看做速度。速度表示一个step更新的大小。加速度总是朝着一个方向,速度必然越来越快。加速度方向总是变化,速度就会相对较小。

\(\gamma\) 看做摩擦系数, 通常设置为 0.9。\(\eta\) 是学习率。

Nesterov accelerate gradient(NAG)

paper: Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2).

We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) [14] is a way to give our momentum term this kind of prescience.
如果采用 momentum,在接近目标函数最优值时,由于速度在垂直方向是一直增加的,所以速度会很大,这个时候就会越过最小值,然后还得绕回来,增加了训练时间。所以我们需要参数的更新具有先见之明,知道在接近最优解时,降低参数更新的速度大小。

\[v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta-\gamma v_{t-1})\] \[\theta=\theta-v_t\]

在 momentum 中,我们用速度 \(\gamma v_{t-1}\) 来更新参数。 事实上在接近局部最优解时,目标函数对于 \(\theta\) 的梯度会越来越小,甚至接近于 0. 也就是说,尽管速度在增加,但是速度增加的程度越来越小。我们可以通过速度增加的程度来判断是否要接近局部最优解了。\(\nabla_{\theta}J(\theta-\gamma v_{t-1})\) 就表示速度变化的程度,代替一直为正的 \(\nabla_{\theta}J(\theta)\),在接近局部最优解时,这个值应该是负的,相应的参数更新的速度也会减小.

在代码实现时,对于 \(J(\theta-\gamma v_{t-1})\) 的梯度计算不是很方便,可以令: \[\phi = \theta-\gamma v_{t-1}\] 然后进行计算,具体可参考 tensorflow 或 pytorch 中代码。

Adagrad

paper: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Adagrad [8] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.
对于不同的参数,自适应的调整对应的梯度大小。对低频参数或特征,使其更新的梯度较大,对高频的参数或特征,使其更新的梯度较小。比如在训练 Glove 词向量时,低频词在某一步迭代中可能并没有参与 loss 的计算,所以更新的会相对较慢,所以需要人为的增大它的梯度。

不同的时间步 t,不同的参数 i 对应的梯度: \[g_{t,i}=\nabla_{\theta_t}J(\theta_t,i)\] \[\theta_{t+1,i}=\theta_{t,i}-\eta \cdot g_{t,i}\] \[\theta_{t+1,i}=\theta_{t,i}-\dfrac{\eta}{\sqrt G_{t,ii}+\epsilon} g_{t,i}\]

\(G_{t,ii}\) 是对角矩阵,对角元素是对应的梯度大小。

1
2
cache += dx**2
x += -lr * dx/(np.sqrt(cache) + 1e-7)

RMSprop

Geoff Hinton Lecture 6e

Adagrad 中随着 cache 的累积,最后的梯度会变为 0,RMSprop 在此基础上进行了改进,给了 cache 一个衰减率,相当于值考虑了最近时刻的梯度值,而很早之前的梯度值经过衰减后影响很小。

\[E[g^2]_ t=0.9E[g^2]_ {t-1}+0.1g^2_t\] \[\theta_{t+1}=\theta_t-\dfrac{\eta}{E[g^2]_ t+\epsilon}g_t\]

1
2
cache = decay_rate*cache + (1-decay_rate)*dx**2
x += -lr * dx/(np.sqrt(cache) + 1e-7)

使用指数衰减的形式来保存 cache 能有效的节省内存,只需要记录当前的梯度值即可,而不用保存所有的梯度值。

Adam(Adaptive Moment Estimation)

Adam: a Method for Stochastic Optimization.

In addition to storing an exponentially decaying average of past squared gradients \(v_t\) like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients \(m_t\), similar to momentum:

similar like momentum:
\[m_t=\beta_1m_{t-1}+(1-\beta_1)g_t\] similar like autograd/RMSprop:
\[v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2\]

\(m_t\) and \(v_t\) are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As \(m_t\) and \(v_t\) are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates:

\[\hat m_t=\dfrac{m_t}{1-\beta^t_1}\]

\[\hat v_t=\dfrac{v_t}{1-\beta^t_2}\]

They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which yields the Adam update rule: \[\theta_{t+1}=\theta_t-\dfrac{\eta}{\sqrt{\hat a}+ \epsilon}{\hat m_t}\]

  • \(m_t\) 是类似于 Momentum 中参数更新量,是梯度的函数. \(\beta_1\) 是摩擦系数,一般设为 0.9.
  • \(v_t\) 是类似于 RMSprop 中的 cache,用来自适应的改变不同参数的梯度大小。
  • \(\beta_2\) 是 cache 的衰减系数,一般设为 0.999.

AdaMax

Adam: a Method for Stochastic Optimization.

在 Adam 中, 用来归一化梯度的因子 \(v_t\) 与过去的梯度(包含在 \(v_{t-1}\) 中)以及当前的梯度 \(|g_t|^2\) 的 l2 范式成反比。 \[v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2\]

可以将其泛化到 \(l_p\) 范式。同样的 \(\beta_2\) 变为 \(\beta_2^p\).

Norms for large p values generally become numerically unstable, which is why \(l_1\) and \(l_2\) norms are most common in practice. However, \(l_{\infty}\) also generally exhibits stable behavior. For this reason, the authors propose AdaMax [10] and show that \(v_t\) with \(l_{\infty}\) converges to the following more stable value. To avoid confusion with Adam, we use ut to denote the infinity norm-constrained \(v_t\):

\[\mu_t=\beta_2^{\infty}v_{t-1}+(1-\beta_2^{\infty})|g_t|^{\infty}\]

\[=max(\beta_2\cdot v_{t-1}, |g_t|)\]

然后用 \(\mu_t\) 代替 Adam 中的 \(\sqrt(v_t)+\epsilon\): \[\theta_{t+1}=\theta_t-\dfrac{\eta}{\mu_t}{\hat m_t}\]

Note that as \(\mu_t\) relies on the max operation, it is not as suggestible to bias towards zero as \(m_t\) and \(v_t\) in Adam, which is why we do not need to compute a bias correction for ut. Good default values are again: \[\eta = 0.002, \beta_1 = 0.9, \beta_2 = 0.999.\]

Visualization of algorithms

> we see the path they took on the contours of a loss surface (the Beale function). All started at the same point and took different paths to reach the minimum. Note that Adagrad, Adadelta, and RMSprop headed off immediately in the right direction and converged similarly fast, while Momentum and NAG were led off-track, evoking the image of a ball rolling down the hill. NAG, however, was able to correct its course sooner due to its increased responsiveness by looking ahead and headed to the minimum.
如果目标函数是 Beale 这种类型的函数,自适应优化算法能更直接的收敛到最小值。而 Momentum 和 NAG 则偏离了轨道,就像球从山上滚下一样,刹不住车。但是 NAG 因为对未来具有一定的预见性,所以能更早的纠正从而提高其响应能力。

shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the latter two eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta leading the charge.
各种优化算法鞍点的表现。 Momentum, SGD, NAG 很难打破平衡,而自适应性的算法 Adadelta, RMSprop, Adadelta 能很快的逃离鞍点。

example

model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import torch
import torch.nn as nn
import torch.optim as optim

class TestNet(nn.Module):
def __init__(self):
super(TestNet, self).__init__()
self.linear1 = nn.Linear(10, 5)
self.linear2 = nn.Linear(5, 1)
self.loss = nn.BCELoss()

def forward(self, x, label):
"""
x: [batch, 10]
label: [batch]
"""
out = self.linear2(self.linear1(x)).squeeze()
loss = self.loss(out, label)
return out, loss
1
2
model = TestNet()
model
TestNet(
  (linear1): Linear(in_features=10, out_features=5, bias=True)
  (linear2): Linear(in_features=5, out_features=1, bias=True)
  (loss): BCELoss()
)
1
list(model.named_parameters())
[('linear1.weight', Parameter containing:
  tensor([[ 0.2901, -0.0022, -0.1515, -0.1064, -0.0475, -0.0324,  0.0404,  0.0266,
           -0.2358, -0.0433],
          [-0.1588, -0.1917,  0.0995,  0.0651, -0.2948, -0.1830,  0.2356,  0.1060,
            0.2172, -0.0367],
          [-0.0173,  0.2129,  0.3123,  0.0663,  0.2633, -0.2838,  0.3019, -0.2087,
           -0.0886,  0.0515],
          [ 0.1641, -0.2123, -0.0759,  0.1198,  0.0408, -0.0212,  0.3117, -0.2534,
           -0.1196, -0.3154],
          [ 0.2187,  0.1547, -0.0653, -0.2246, -0.0137,  0.2676,  0.1777,  0.0536,
           -0.3124,  0.2147]], requires_grad=True)),
 ('linear1.bias', Parameter containing:
  tensor([ 0.1216,  0.2846, -0.2002, -0.1236,  0.2806], requires_grad=True)),
 ('linear2.weight', Parameter containing:
  tensor([[-0.1652,  0.3056,  0.0749, -0.3633,  0.0692]], requires_grad=True)),
 ('linear2.bias', Parameter containing:
  tensor([0.0450], requires_grad=True))]

add model parameters to optimizer

1
2
3
4
import torch.optim as optim

# parameters = model.parameters()
parameters_filters = filter(lambda p: p.requires_grad, model.parameters())
1
2
3
4
5
6
optimizer = optim.Adam(
params=parameters_filters,
lr=0.001,
betas=(0.8, 0.999),
eps=1e-8,
weight_decay=3e-7)
1
optimizer.state_dict
<bound method Optimizer.state_dict of Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.8, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 3e-07
)>

不同的模块设置不同的参数

1
2
parameters = [{"params": model.linear1.parameters()},
{"params":model.linear2.parameters(), "lr": 3e-4}]
1
2
3
4
5
6
optimizer2 = optim.Adam(
params=parameters,
lr=0.001,
betas=(0.8, 0.999),
eps=1e-8,
weight_decay=3e-7)
1
optimizer2.state_dict
<bound method Optimizer.state_dict of Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.8, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 3e-07

Parameter Group 1
    amsgrad: False
    betas: (0.8, 0.999)
    eps: 1e-08
    lr: 0.0003
    weight_decay: 3e-07
)>

zero_grad

在进行反向传播之前,如果不需要梯度累加的话,必须要用zero_grad()清空梯度。具体的方法是遍历self.param_groups中全部参数,根据grad属性做清除。

1
2
3
4
5
6
7
def zero_grad(self):
r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
p.grad.detach_()
p.grad.zero_()
1
2
group_parameters = [{"params": model.linear1.parameters()},
{"params":model.linear2.parameters(), "lr": 3e-4}]
1
2
3
4
x = torch.randn(2, 10)
label = torch.Tensor([1,0])
out, loss = model(x, label)
loss.backward()
1
optimizer2.zero_grad()
1
2
3
4
5
for group in group_parameters:
for p in group["params"]:
if p.grad is not None:
p.grad.detach_()
p.grad.zero_()

这里并没有使用 backward() 所以暂时不存在梯度。

在反向传播 backward() 计算出梯度之后,就可以调用step()实现参数更新。不过在 Optimizer 类中,step()函数内部是空的,并且用raise NotImplementError 来作为提醒。后面会根据具体的优化器来分析step()的实现思路。

辅助类lr_scheduler

lr_scheduler用于在训练过程中根据轮次灵活调控学习率。调整学习率的方法有很多种,但是其使用方法是大致相同的:用一个Schedule把原始Optimizer装饰上,然后再输入一些相关参数,然后用这个Schedule做step()。

1
2
3
# lambda1 = lambda epoch: epoch // 30
lambda1 = lambda epoch: 0.95 ** epoch
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda1)
1
scheduler.step()

warm up scheduler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import math
parameters = filter(lambda p: p.requires_grad, model.parameters())

lr_warm_up_num = 1000

optimizer = optim.Adam(
params=parameters,
lr=0.001,
betas=(0.8, 0.999),
eps=1e-8,
weight_decay=3e-7)

cr = 1.0 / math.log(lr_warm_up_num)

scheduler = optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=lambda ee: cr * math.log(ee + 1)
if ee < lr_warm_up_num else 1)