computes the gradient of the cost function to the parameters $\theta$ for the entire training dataset.

$\theta= \theta - \delta_{\theta}J(\theta)$

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i):

$\theta= \theta - \delta_{\theta}J(\theta; x^{(i)}; y^{(i)})$

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily.

While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples.

$\theta= \theta - \delta_{\theta}J(\theta; x^{(i+n)}; y^{(i+n)})$

• reduces the variance of the parameter updates, which can lead to more stable convergence;
减小参数更新的方差，使得收敛更稳定。
• can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient mini-batch very efficient.
能非常好的利用矩阵优化的方式来加速计算，这在各种深度学习框架里面都很常见。

Challenges

• Choosing a proper learning rate.
选择合适的学习率。

• Learning rate schedules. try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.
学习率计划。在训练过程中调整学习率，譬如退火，预先定义好的计划，当一个 epoch 结束后，目标函数（loss） 减小的值低于某个阈值时，可以调整学习率。

• the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
对所有的参数使用相同的学习率。如果你的数据是稀疏的，并且不同的特征的频率有很大的不同，这个时候我们并不希望对所有的参数使用相同的学习率，而是对更罕见的特征执行更大的学习率。

• Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [5] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
对于非凸损失函数的优化问题，需要避免陷入其众多的次优局部极小值。Dauphin et al. [5] 则认为， 相比局部极小值，鞍点的是更难解决的问题。鞍点是一个维度上升，一个维度下降。详细的关于鞍点以及 SGD 如何逃离鞍点可参考：知乎：如何逃离鞍点 .

Momentum [17] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by padding a fraction $gamma$ of the update vector of the past time step to the current update vector.

Momentum

without Momentum: $\theta += -lr * \nabla_{\theta}J(\theta)$

with Momentum: $v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta)$ $\theta=\theta-v_t$

$\gamma$ 看做摩擦系数， 通常设置为 0.9。$\eta$ 是学习率。

We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) [14] is a way to give our momentum term this kind of prescience.

$v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta-\gamma v_{t-1})$ $\theta=\theta-v_t$

Adagrad [8] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

$G_{t,ii}$ 是对角矩阵，对角元素是对应的梯度大小。

RMSprop

Geoff Hinton Lecture 6e

$E[g^2]_ t=0.9E[g^2]_ {t-1}+0.1g^2_t$ $\theta_{t+1}=\theta_t-\dfrac{\eta}{E[g^2]_ t+\epsilon}g_t$

Adam: a Method for Stochastic Optimization.

In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum:

similar like momentum:
$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t$ similar like autograd/RMSprop:
$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$

$m_t$ and $v_t$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As $m_t$ and $v_t$ are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates:

$\hat m_t=\dfrac{m_t}{1-\beta^t_1}$

$\hat v_t=\dfrac{v_t}{1-\beta^t_2}$

They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which yields the Adam update rule: $\theta_{t+1}=\theta_t-\dfrac{\eta}{\sqrt{\hat a}+ \epsilon}{\hat m_t}$

• $m_t$ 是类似于 Momentum 中参数更新量，是梯度的函数. $\beta_1$ 是摩擦系数，一般设为 0.9.
• $v_t$ 是类似于 RMSprop 中的 cache，用来自适应的改变不同参数的梯度大小。
• $\beta_2$ 是 cache 的衰减系数，一般设为 0.999.

Adam: a Method for Stochastic Optimization.

Norms for large p values generally become numerically unstable, which is why $l_1$ and $l_2$ norms are most common in practice. However, $l_{\infty}$ also generally exhibits stable behavior. For this reason, the authors propose AdaMax [10] and show that $v_t$ with $l_{\infty}$ converges to the following more stable value. To avoid confusion with Adam, we use ut to denote the infinity norm-constrained $v_t$:

$\mu_t=\beta_2^{\infty}v_{t-1}+(1-\beta_2^{\infty})|g_t|^{\infty}$

$=max(\beta_2\cdot v_{t-1}, |g_t|)$

Note that as $\mu_t$ relies on the max operation, it is not as suggestible to bias towards zero as $m_t$ and $v_t$ in Adam, which is why we do not need to compute a bias correction for ut. Good default values are again: $\eta = 0.002, \beta_1 = 0.9, \beta_2 = 0.999.$

Visualization of algorithms

> we see the path they took on the contours of a loss surface (the Beale function). All started at the same point and took different paths to reach the minimum. Note that Adagrad, Adadelta, and RMSprop headed off immediately in the right direction and converged similarly fast, while Momentum and NAG were led off-track, evoking the image of a ball rolling down the hill. NAG, however, was able to correct its course sooner due to its increased responsiveness by looking ahead and headed to the minimum.

shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the latter two eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta leading the charge.

example

model

TestNet(
(linear1): Linear(in_features=10, out_features=5, bias=True)
(linear2): Linear(in_features=5, out_features=1, bias=True)
(loss): BCELoss()
)
[('linear1.weight', Parameter containing:
tensor([[ 0.2901, -0.0022, -0.1515, -0.1064, -0.0475, -0.0324,  0.0404,  0.0266,
-0.2358, -0.0433],
[-0.1588, -0.1917,  0.0995,  0.0651, -0.2948, -0.1830,  0.2356,  0.1060,
0.2172, -0.0367],
[-0.0173,  0.2129,  0.3123,  0.0663,  0.2633, -0.2838,  0.3019, -0.2087,
-0.0886,  0.0515],
[ 0.1641, -0.2123, -0.0759,  0.1198,  0.0408, -0.0212,  0.3117, -0.2534,
-0.1196, -0.3154],
[ 0.2187,  0.1547, -0.0653, -0.2246, -0.0137,  0.2676,  0.1777,  0.0536,
('linear1.bias', Parameter containing:
tensor([ 0.1216,  0.2846, -0.2002, -0.1236,  0.2806], requires_grad=True)),
('linear2.weight', Parameter containing:
tensor([[-0.1652,  0.3056,  0.0749, -0.3633,  0.0692]], requires_grad=True)),
('linear2.bias', Parameter containing:
tensor([0.0450], requires_grad=True))]

<bound method Optimizer.state_dict of Adam (
Parameter Group 0
betas: (0.8, 0.999)
eps: 1e-08
lr: 0.001
weight_decay: 3e-07
)>

不同的模块设置不同的参数

<bound method Optimizer.state_dict of Adam (
Parameter Group 0
betas: (0.8, 0.999)
eps: 1e-08
lr: 0.001
weight_decay: 3e-07

Parameter Group 1
betas: (0.8, 0.999)
eps: 1e-08
lr: 0.0003
weight_decay: 3e-07
)>