# 深度学习-优化算法

computes the gradient of the cost function to the parameters $\theta$ for the entire training dataset.

$$\theta= \theta - \delta_{\theta}J(\theta)$$

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i):

$$\theta= \theta - \delta_{\theta}J(\theta; x^{(i)}; y^{(i)})$$

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn

online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily.

While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately

complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost

certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples.

$$\theta= \theta - \delta_{\theta}J(\theta; x^{(i+n)}; y^{(i+n)})$$

• reduces the variance of the parameter updates, which can lead to more stable convergence;

• can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient mini-batch very efficient.

## Challenges

• Choosing a proper learning rate.

• Learning rate schedules. try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.

• the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring

features.

• Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [5] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

Momentum [17] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by padding a fraction $gamma$ of the update vector of the past time step to the current

update vector.

### Momentum

paper: [Neural networks :

the official journal of the International Neural Network Society]()

without Momentum:

$$\theta += -lr * \nabla_{\theta}J(\theta)$$

with Momentum:

$$v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta)$$

$$\theta=\theta-v_t$$

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.

$\gamma$ 看做摩擦系数， 通常设置为 0.9。$\eta$ 是学习率。

paper: [Yurii Nesterov. A method for unconstrained convex minimization problem

with the rate of convergence o(1/k2).]()

We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) [14] is a way to give our momentum term this kind of prescience.

$$v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta-\gamma v_{t-1})$$

$$\theta=\theta-v_t$$

$$\phi = \theta-\gamma v_{t-1}$$

and Stochastic Optimization]()

Adagrad [8] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

$$g_{t,i}=\nabla_{\theta_t}J(\theta_t,i)$$

$$\theta_{t+1,i}=\theta_{t,i}-\eta \cdot g_{t,i}$$

$$\theta_{t+1,i}=\theta_{t,i}-\dfrac{\eta}{\sqrt G_{t,ii}+\epsilon} g_{t,i}$$

$G_{t,ii}$ 是对角矩阵，对角元素是对应的梯度大小。

### RMSprop

Geoff Hinton Lecture 6e

$$E[g^2]_ t=0.9E[g^2]_ {t-1}+0.1g^2_t$$

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{E[g^2]_ t+\epsilon}g_t$$

Adam: a Method for Stochastic Optimization.

In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum:

similar like momentum:

$$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t$$

$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

$m_t$ and $v_t$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As $m_t$ and $v_t$ are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates:

$$\hat m_t=\dfrac{m_t}{1-\beta^t_1}$$

$$\hat v_t=\dfrac{v_t}{1-\beta^t_2}$$

They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{\sqrt{\hat a}+ \epsilon}{\hat m_t}$$

• $m_t$ 是类似于 Momentum 中参数更新量，是梯度的函数. $\beta_1$ 是摩擦系数，一般设为 0.9.

• $v_t$ 是类似于 RMSprop 中的 cache，用来自适应的改变不同参数的梯度大小。

• $\beta_2$ 是 cache 的衰减系数，一般设为 0.999.

Adam: a Method for Stochastic Optimization.

$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

Norms for large p values generally become numerically unstable, which is why $l_1$ and $l_2$ norms are most common in practice. However, $l_{\infty}$ also generally exhibits stable behavior. For this reason, the authors propose AdaMax [10] and show that $v_t$ with $l_{\infty}$ converges to the following more stable value. To avoid confusion with Adam, we use ut to denote the infinity norm-constrained $v_t$:

$$\mu_t=\beta_2^{\infty}v_{t-1}+(1-\beta_2^{\infty})|g_t|^{\infty}$$

$$=max(\beta_2\cdot v_{t-1}, |g_t|)$$

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{\mu_t}{\hat m_t}$$

Note that as $\mu_t$ relies on the max operation, it is not as suggestible to bias towards zero as $m_t$ and $v_t$ in Adam, which is why we do not need to compute a bias correction for ut. Good default values are again:

$$\eta = 0.002, \beta_1 = 0.9, \beta_2 = 0.999.$$

## Visualization of algorithms

we see the path they took on the contours of a loss surface (the Beale function). All started at the same point and took different paths to reach the minimum. Note that Adagrad, Adadelta, and RMSprop headed off immediately in the right direction and converged similarly fast, while Momentum and NAG were led off-track, evoking the image of a ball rolling down the hill. NAG, however, was able to correct its course sooner due to its increased responsiveness by looking ahead and headed to the minimum.

shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the latter two eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta leading the charge.

## example

### model

TestNet(

(linear1): Linear(in_features=10, out_features=5, bias=True)

(linear2): Linear(in_features=5, out_features=1, bias=True)

(loss): BCELoss()

)

[('linear1.weight', Parameter containing:

tensor([[ 0.2901, -0.0022, -0.1515, -0.1064, -0.0475, -0.0324,  0.0404,  0.0266,

-0.2358, -0.0433],

[-0.1588, -0.1917,  0.0995,  0.0651, -0.2948, -0.1830,  0.2356,  0.1060,

0.2172, -0.0367],

[-0.0173,  0.2129,  0.3123,  0.0663,  0.2633, -0.2838,  0.3019, -0.2087,

-0.0886,  0.0515],

[ 0.1641, -0.2123, -0.0759,  0.1198,  0.0408, -0.0212,  0.3117, -0.2534,

-0.1196, -0.3154],

[ 0.2187,  0.1547, -0.0653, -0.2246, -0.0137,  0.2676,  0.1777,  0.0536,

('linear1.bias', Parameter containing:

tensor([ 0.1216,  0.2846, -0.2002, -0.1236,  0.2806], requires_grad=True)),

('linear2.weight', Parameter containing:

tensor([[-0.1652,  0.3056,  0.0749, -0.3633,  0.0692]], requires_grad=True)),

('linear2.bias', Parameter containing:



### add model parameters to optimizer

<bound method Optimizer.state_dict of Adam (

Parameter Group 0

betas: (0.8, 0.999)

eps: 1e-08

lr: 0.001

weight_decay: 3e-07

)>


### 不同的模块设置不同的参数

<bound method Optimizer.state_dict of Adam (

Parameter Group 0

betas: (0.8, 0.999)

eps: 1e-08

lr: 0.001

weight_decay: 3e-07

Parameter Group 1

betas: (0.8, 0.999)

eps: 1e-08

lr: 0.0003

weight_decay: 3e-07

)>


### 辅助类lr_scheduler

lr_scheduler用于在训练过程中根据轮次灵活调控学习率。调整学习率的方法有很多种，但是其使用方法是大致相同的：用一个Schedule把原始Optimizer装饰上，然后再输入一些相关参数，然后用这个Schedule做step()。

Xie Pan

2018-11-27

2021-06-29