# 论文笔记-dropblock

paper:

dropblock 是关于 CNN 的，后两篇是关于 RNN 的正则化。

# DropBlock

## Motivation

Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that

activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout.

Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices.

## dropblock

In this paper, we introduce DropBlock, a structured form of dropout, that is particularly effective to regularize convolutional networks. In DropBlock, features in a block, i.e., a contiguous region of a feature map, are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data (see Figure 1).

• block_size is the size of the block to be dropped

• $\gamma$ controls how many activation units to drop.

We experimented with a shared DropBlock mask across different feature channels or each feature channel has its DropBlock mask. Algorithm 1 corresponds to the latter, which tends to work better in our experiments.

Similar to dropout we do not apply DropBlock during inference. This is interpreted as evaluating an averaged prediction across the exponentially-sized ensemble of sub-networks. These sub-networks include a special subset of sub-networks covered by dropout where each network does not see contiguous parts of feature maps.

block_size:

In our implementation, we set a constant block_size for all feature maps, regardless the resolution of feature map. DropBlock resembles dropout [1] when block_size = 1 and resembles SpatialDropout [20] when block_size covers the full feature map.

block_size 设置为 1 时, 类似于 dropout. 当 block_size 设置为整个 feature map 的 size 大小时，就类似于 SpatialDropout.

setting the value of $\gamma$:

In practice, we do not explicitly set $\gamma$. As stated earlier, $\gamma$ controls the number of features to drop. Suppose that we want to keep every activation unit with the probability of keep_prob, in dropout [1] the binary mask will be sampled with the Bernoulli distribution with mean 1 − keep_prob. However, to account for the fact that every zero entry in the mask will be expanded by block_size2 and the blocks will be fully contained in feature map, we need to adjust $\gamma$ accordingly when we sample the initial binary mask. In our implementation, $\gamma$ can be computed as

• keep_prob 是传统的 dropout 的概率，通常设置为 0.75-0.9.

• feat_size 是整个 feature map 的 size 大小。

• (feat_size - block_size + 1) 是选择 dropblock 中心位置的有效区域。

The main nuance of DropBlock is that there will be some overlapped in the dropped blocks, so the above equation is only an approximation.

Scheduled DropBlock:

We found that DropBlock with a fixed keep_prob during training does not work well. Applying small value of keep_prob hurts learning at the beginning. Instead, gradually decreasing keep_prob over time from 1 to the target value is more robust and adds improvement for the most values of keep_prob.

## Experiments

In the following experiments, we study where to apply DropBlock in residual networks. We experimented with applying DropBlock only after convolution layers or applying DropBlock after both convolution layers and skip connections. To study the performance of DropBlock applying to different feature groups, we experimented with applying DropBlock to Group 4 or to both Groups 3 and 4.

# 深度学习-优化算法

computes the gradient of the cost function to the parameters $\theta$ for the entire training dataset.

$$\theta= \theta - \delta_{\theta}J(\theta)$$

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i):

$$\theta= \theta - \delta_{\theta}J(\theta; x^{(i)}; y^{(i)})$$

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn

online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily.

While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately

complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost

certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples.

$$\theta= \theta - \delta_{\theta}J(\theta; x^{(i+n)}; y^{(i+n)})$$

• reduces the variance of the parameter updates, which can lead to more stable convergence;

• can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient mini-batch very efficient.

## Challenges

• Choosing a proper learning rate.

• Learning rate schedules. try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.

• the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring

features.

• Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [5] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

Momentum [17] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by padding a fraction $gamma$ of the update vector of the past time step to the current

update vector.

### Momentum

paper: [Neural networks :

the official journal of the International Neural Network Society]()

without Momentum:

$$\theta += -lr * \nabla_{\theta}J(\theta)$$

with Momentum:

$$v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta)$$

$$\theta=\theta-v_t$$

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.

$\gamma$ 看做摩擦系数， 通常设置为 0.9。$\eta$ 是学习率。

paper: [Yurii Nesterov. A method for unconstrained convex minimization problem

with the rate of convergence o(1/k2).]()

We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) [14] is a way to give our momentum term this kind of prescience.

$$v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta-\gamma v_{t-1})$$

$$\theta=\theta-v_t$$

$$\phi = \theta-\gamma v_{t-1}$$

and Stochastic Optimization]()

Adagrad [8] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

$$g_{t,i}=\nabla_{\theta_t}J(\theta_t,i)$$

$$\theta_{t+1,i}=\theta_{t,i}-\eta \cdot g_{t,i}$$

$$\theta_{t+1,i}=\theta_{t,i}-\dfrac{\eta}{\sqrt G_{t,ii}+\epsilon} g_{t,i}$$

$G_{t,ii}$ 是对角矩阵，对角元素是对应的梯度大小。

### RMSprop

Geoff Hinton Lecture 6e

$$E[g^2]_ t=0.9E[g^2]_ {t-1}+0.1g^2_t$$

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{E[g^2]_ t+\epsilon}g_t$$

Adam: a Method for Stochastic Optimization.

In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum:

similar like momentum:

$$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t$$

$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

$m_t$ and $v_t$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As $m_t$ and $v_t$ are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates:

$$\hat m_t=\dfrac{m_t}{1-\beta^t_1}$$

$$\hat v_t=\dfrac{v_t}{1-\beta^t_2}$$

They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{\sqrt{\hat a}+ \epsilon}{\hat m_t}$$

• $m_t$ 是类似于 Momentum 中参数更新量，是梯度的函数. $\beta_1$ 是摩擦系数，一般设为 0.9.

• $v_t$ 是类似于 RMSprop 中的 cache，用来自适应的改变不同参数的梯度大小。

• $\beta_2$ 是 cache 的衰减系数，一般设为 0.999.

Adam: a Method for Stochastic Optimization.

$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

Norms for large p values generally become numerically unstable, which is why $l_1$ and $l_2$ norms are most common in practice. However, $l_{\infty}$ also generally exhibits stable behavior. For this reason, the authors propose AdaMax [10] and show that $v_t$ with $l_{\infty}$ converges to the following more stable value. To avoid confusion with Adam, we use ut to denote the infinity norm-constrained $v_t$:

$$\mu_t=\beta_2^{\infty}v_{t-1}+(1-\beta_2^{\infty})|g_t|^{\infty}$$

$$=max(\beta_2\cdot v_{t-1}, |g_t|)$$

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{\mu_t}{\hat m_t}$$

Note that as $\mu_t$ relies on the max operation, it is not as suggestible to bias towards zero as $m_t$ and $v_t$ in Adam, which is why we do not need to compute a bias correction for ut. Good default values are again:

$$\eta = 0.002, \beta_1 = 0.9, \beta_2 = 0.999.$$

## Visualization of algorithms

we see the path they took on the contours of a loss surface (the Beale function). All started at the same point and took different paths to reach the minimum. Note that Adagrad, Adadelta, and RMSprop headed off immediately in the right direction and converged similarly fast, while Momentum and NAG were led off-track, evoking the image of a ball rolling down the hill. NAG, however, was able to correct its course sooner due to its increased responsiveness by looking ahead and headed to the minimum.

shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the latter two eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta leading the charge.

## example

### model

TestNet(

(linear1): Linear(in_features=10, out_features=5, bias=True)

(linear2): Linear(in_features=5, out_features=1, bias=True)

(loss): BCELoss()

)

[('linear1.weight', Parameter containing:

tensor([[ 0.2901, -0.0022, -0.1515, -0.1064, -0.0475, -0.0324,  0.0404,  0.0266,

-0.2358, -0.0433],

[-0.1588, -0.1917,  0.0995,  0.0651, -0.2948, -0.1830,  0.2356,  0.1060,

0.2172, -0.0367],

[-0.0173,  0.2129,  0.3123,  0.0663,  0.2633, -0.2838,  0.3019, -0.2087,

-0.0886,  0.0515],

[ 0.1641, -0.2123, -0.0759,  0.1198,  0.0408, -0.0212,  0.3117, -0.2534,

-0.1196, -0.3154],

[ 0.2187,  0.1547, -0.0653, -0.2246, -0.0137,  0.2676,  0.1777,  0.0536,

('linear1.bias', Parameter containing:

tensor([ 0.1216,  0.2846, -0.2002, -0.1236,  0.2806], requires_grad=True)),

('linear2.weight', Parameter containing:

tensor([[-0.1652,  0.3056,  0.0749, -0.3633,  0.0692]], requires_grad=True)),

('linear2.bias', Parameter containing:



### add model parameters to optimizer

<bound method Optimizer.state_dict of Adam (

Parameter Group 0

betas: (0.8, 0.999)

eps: 1e-08

lr: 0.001

weight_decay: 3e-07

)>


### 不同的模块设置不同的参数

<bound method Optimizer.state_dict of Adam (

Parameter Group 0

betas: (0.8, 0.999)

eps: 1e-08

lr: 0.001

weight_decay: 3e-07

Parameter Group 1

betas: (0.8, 0.999)

eps: 1e-08

lr: 0.0003

weight_decay: 3e-07

)>


### 辅助类lr_scheduler

lr_scheduler用于在训练过程中根据轮次灵活调控学习率。调整学习率的方法有很多种，但是其使用方法是大致相同的：用一个Schedule把原始Optimizer装饰上，然后再输入一些相关参数，然后用这个Schedule做step()。

# 论文笔记-batch,layer,weights normalization

paper:

## Layer Normalization

### Motivation

batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case.

Batch Normalization 就是将线性输出归一化。

batch normalization requires running averages of the summed input statistics. In feed-forward networks with fixed depth, it is straightforward to store the statistics separately for each hidden layer. However, the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps.

BN 不是用于 RNN 是因为 batch 中的 sentence 长度不一致。我们可以把每一个时间步看作一个维度的特征提取，如果像 BN 一样在这个维度上进行归一化，显然在 RNN 上是行不通的。比如这个 batch 中最长的序列的最后一个时间步，他的均值就是它本身了，岂不是出现了 BN 在单个样本上训练的情况。

In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.

### Layer Normalization

layer normalization 并不是在样本上求平均值和方差，而是在 hidden units 上求平均值和方差。

BN 和 LN 的差异：

Layer normalisztion 在单个样本上取均值和方差，所以在训练和测试阶段都是一致的。

### Layer normalized recurrent neural networks

RNN is common among the NLP tasks to have different sentence lengths for different training cases. This is easy to deal with in an RNN because the same weights are used at every time-step. But when we apply batch normalization to an RNN in the obvious way, we need to to compute and store separate statistics for each time step in a sequence. This is problematic if a test sequence is longer than any of the training sequences. Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step. It also has only one set of gain and bias parameters shared over all time-steps.

$a^t=W_{hh}h^{t-1}+W_{xh}x^t$

layer normalize 在 LSTM 上的使用：

## tensorflow 实现

### layer normalization

layer_norm_mine 得到的结果与源码一致。可以发现 计算均值和方差时， tf.nn.momentsaxes=[1:-1]. （tf.nn.moments 中 axes 的含义是在这些维度上求均值和方差）. 也就是说得到的均值和方差确实是 [batch,]. 只是在转换成 beta 和 gamma 的分布时，依旧是在最后一个维度上进行的。有意思，所以最终的效果应该和 batch normalization 效果是一致的。只不过是否符合图像或文本的特性就另说了。

# 深度学习-Batch Normalization

## Motivation

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift.

stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers – so that small changes to the network parameters amplify as the network becomes deeper.

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system

as a whole, to apply to its parts, such as a sub-network or a layer.

Therefore, the input distribution properties that aid the network generalization – such as having the same distribution between the training and test data – apply to training the sub-network as well.As such it is advantageous for the distribution of x to remain fixed over time.

Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the subnetwork, as well.

$z=g(Wu+b)$

• ReLU

• 初始化 Xavier initialization.

• 用一个较小的学习速率进行学习

If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs.

Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or

of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.

BN 除了能解决 internal covariate shift 的问题，还能够降低梯度对学习率，初始化参数设置的依赖。这使得我们可以使用较大的学习率，正则化模型，降低对 dropout 的需求，最后还保证网络能够使用具有饱和性的非线性激活函数。

## Towards Reducing Internal Covariate Shift

### whitening 白化操作

It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated.

However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.

$x=u+b \rightarrow \hat x = x-E[x] \rightarrow loss$

$\dfrac{\partial l}{\partial b}=\dfrac{\partial l}{\partial \hat x}\dfrac{\partial \hat x}{\partial b} = \dfrac{\partial l}{\partial \hat x}$

$u+(b+\Delta b)-E[u+(b+\Delta b)]=u+b-E[u+b]$

This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step.

### 进行白化操作，并且在优化时考虑标准化的问题

The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place. To address this issue, we would like to ensure that, for any parameter values, the network always produces activations with the desired distribution.Doing so would allow the gradient of the loss with respect to the model parameters to account for the normalization, and for its dependence on the model parameters Θ.

$\hat x = Norm(x, \chi)$

$$\frac{\partial{Norm(x,\chi)}}{\partial{x}}\text{ and }\frac{\partial{Norm(x,\chi)}}{\partial{\chi}}$$

## Normalization via Mini-Batch Statistics

### 对比于白化的两个简化

Since the full whitening of each layer’s inputs is costly, we make two necessary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have zero mean and unit variance.

$$\hat x^{(k)} = \dfrac{x^{(k)}-E[x^{(k)}]}{\sqrt {Var[x^{(k)}]}}$$

where the expectation and variance are computed over the

training data set.

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.

$$y^{(k)} = \gamma^{(k)}\hat x^{(k)} + \beta^{(k)}$$

$\gamma^{(k)}$, $\beta^{(k)}$ 是可学习的参数，用来回复经过标准化之后的网络的表达能力。如果 $\gamma^{(k)}=\sqrt {Var[x^{(k)}]}$, $\beta^{(k)}=E[x^{(k)}]$

In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when using stochastic optimization. Therefore, we make the second simplification: since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.

Note that the use of mini-batches is enabled by computation of per-dimension variances rather than joint covariances; in the joint case, regularization would be required since the mini-batch size is likely to be smaller than the number of activations being whitened, resulting in singular covariance matrices.

### BN 核心流程

batch size m, 我们关注其中某一个维度 $x^{k}$, k 表示第k维特征。那么对于 batch 中该维特征的 m 个值：

$$B={x_{1,…,m}}$$

$$BN_{\gamma, \beta}:x_{1,..,m}\rightarrow y_{1,..,m}$$

• 对于输入的 mini-batch 的一个维度，计算均值和方差

• 标准化（注意 epsilon 避免0错误）

• 使用两个参数进行平移和缩放

Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal covariate shift, thus accelerating the training.

BN 是可微的，保证模型可训练，网络可以学习得到输入的分布，来减小 internal covarite shift, 从而加速训练。

### Training and Inference with Batch-Normalized Networks

The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want

the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization

$\hat x = \dfrac{x-E[x]}{\sqrt{Var[x]+\epsilon}}$

using the population, rather than mini-batch, statistics.

Using moving averages instead, we can track the accuracy of a model as it trains. Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.

### Batch-Normalized Convolutional Networks

• 第1-5步是算法1的流程，对每一维标准化，得到 $N_{BN}^{tr}$

• 6-7步优化训练参数 $\theta \bigcup {\gamma^{k}, \beta^{k}}$，在测试阶段参数是固定的

• 8-12步骤是将训练阶段的统计信息转化为训练集整体的统计信息。因为完成训练后在预测阶段，我们使用的是模型存储的整体的统计信息。这里涉及到通过样本均值和方差估计总体的均值和方差的无偏估计，样本均值是等于总体均值的无偏估计的，而样本均值不等于总体均值的无偏估计。具体可看知乎上的解答 https://www.zhihu.com/question/20099757

### Batch Normalization enables higher learning rates

In traditional deep networks, too high a learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima.

By normalizing activations throughout the network, it prevents small changes in layer parameters from amplifying as the data propagates through a deep network.

Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagation and lead to the model explosion. However, with Batch Normalization, backpropagation through a layer is unaffected by the scale of its parameters.

BN 能让训练时的参数更有弹性。通常，学习率过大会增大网络参数，在反向传播中导致梯度过大而发生梯度爆炸。而 BN 使得网络不受参数的大小的影响。

# 代码实现

tensorflow 已经封装好了 BN 层，可以直接通过 tf.contrib.layers.batch_norm() 调用，如果你想知道函数背后的具体实现方法，加深对BN层的理解，可以参考这篇文章Implementing Batch Normalization in Tensorflow

# 机器学习-过拟合

## 解决方法

### 获取更多数据

• 从数据源头获取更多数据：这个是容易想到的，例如物体分类，我就再多拍几张照片好了；但是，在很多情况下，大幅增加数据本身就不容易；另外，我们不清楚获取多少数据才算够；

• 根据当前数据集估计数据分布参数，使用该分布产生更多数据：这个一般不用，因为估计分布参数的过程也会代入抽样误差。

• 数据增强（Data Augmentation）：通过一定规则扩充数据。如在物体分类问题里，物体在图像中的位置、姿态、尺度，整体图片明暗度等都不会影响分类结果。我们就可以通过图像平移、翻转、缩放、切割等手段将数据库成倍扩充；

### 使用合适的模型

（PS：如果能通过物理、数学建模，确定模型复杂度，这是最好的方法，这也就是为什么深度学习这么火的现在，我还坚持说初学者要学掌握传统的建模方法。）

#### 在权值上加噪声

Graves, Alex, et al. “A novel connectionist system for unconstrained handwriting recognition.” IEEE transactions on pattern analysis and machine intelligence 31.5 (2009): 855-868.

• It may work better, especially in recurrent networks (Hinton)

# 深度学习-权重初始化

• 为什么要权重初始化

• Xavier初始化的推导

### 权重初始化

In order to avoid neurons becoming too correlated and ending up in poor local minimize, it is often helpful to randomly initialize parameters. 为了避免神经元高度相关和局部最优化，常常需要采用随机初始化权重参数，最常用的就是Xavier initiazation.

#### 为什么我们需要权重初始化？

$$y_j=w_{j,1}x_1+w_{j,2}x_2+…+w_{(j,1000)}x_{1000}$$

x可以看作是1000维的正态分布，每一维 $x_i\sim N(0,1)$, 如果 $w_j$值很大，比如 $w_j=[100,100,…,100]$，那么输出神经元 $y_i$ 的方差就是10000，所以就会很大,均值还是0.

#### Xavier Initialization

$$y_j=w_{j,1}x_1+w_{j,2}x_2+…+w_{j,N} x_N+b$$

$$var(y_j) = var(w_{j,1}x_1+w_{j,2}x_2+…+w_{j,N} x_N+b)$$

$$var(w_{j,i}x_i) = E(x_i)^2var(w_{j,i}) + E(w_{j,i})^2var(xi) + var(w_{j,i})var(x_i)$$

$$var(w_{j,i}x_i)=var(w_{j,i})var(x_i)$$

$$var(y_j) = var(w_{j,1})var(x_1) + … + var(w_{j,N})var(x_N)$$

$$var(y_j) = N * var(w{j,i}) * var(x_i)$$

$$N*var(W_{j,i})=1$$

$$var(W_{j,i})=1/N$$

There we go! 这样我们就得到了Xavier initialization的初始化公式，也就是说参数权重初始化为均值为0，方差为 1/N 的高斯分布，其中N表示当前层输入神经元的个数。在caffe中就是这样实现的。

#### 更多初始化方式

$$var(w)=2/(N_{in}+N_{out})$$

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification 针对一种专门的初始化方式，使得 $var(w)=2.0/N$, 在实际工程中通常使用这种方式。

# 深度学习-Dropout

dropout的数学原理。

### Dropout

python代码:

#### 反向随机失活

Dropout paper by Srivastava et al. 2014.

Dropout Training as Adaptive Regularization：“我们认为：在使用费希尔信息矩阵（fisher information matrix）的对角逆矩阵的期望对特征进行数值范围调整后，再进行L2正则化这一操作，与随机失活正则化是一阶相等的。”

# 机器学习中的一些 tricks

L2正则化的数学原理

### L2正则化：

To avoid parameters from exploding or becoming highly correlated, it is helpful to augment our cost function with a Gaussian prior: this tends to push parameter weights closer to zero, without constraining their direction, and often leads to classifiers with better generalization ability.

If we maximize log-likelihood (as with the cross-entropy loss, above), then the Gaussian prior becomes a quadratic term 1 (L2 regularization):

$$J_{reg}(\theta)=\dfrac{\lambda}{2}[\sum_{i,j}{W_1}{i,j}^2+\sum{i’j’}{W_2}_{i,j}^2]$$

$$W_{ij} ∼ N (0; 1=λ)$$