Variational Dropout：A Theoretically Grounded Application of Dropout in Recurrent Neural Networks
dropblock 是关于 CNN 的，后两篇是关于 RNN 的正则化。
Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that
activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout.
通常深度神经网络在过参数化，并在训练时加上大量的噪声和正则化，比如权重衰减和 dropout，这个时候神经网络能很好的 work. 但是 dropout 对于全链接网络是一个非常有效的正则化技术，它对于卷积神经网络却没啥效果。这可能是因为卷积神经网络的激活是空间相关的，即使 drop 掉部分 unit，信息仍然会传递到下一层网络中去。
Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices.
作者为卷积神经网络提出了专门的正则化方式， dropblock. 同时 drop 掉一个连续的空间。作者发现将 dropblock 应用到 ResNet 能有效的提高准确率。同时增加 drop 的概率能提高参数的鲁棒性。
回顾了一下 skip/shortcut connection: 目的是避免梯度消失。可以直接看 GRU 的公式：参考笔记
In this paper, we introduce DropBlock, a structured form of dropout, that is particularly effective to regularize convolutional networks. In DropBlock, features in a block, i.e., a contiguous region of a feature map, are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data (see Figure 1).
具体的算法很简单，主要关注两个参数的设置： block_size 和 $\gamma$.
block_size is the size of the block to be dropped
$\gamma$ controls how many activation units to drop.
We experimented with a shared DropBlock mask across different feature channels or each feature channel has its DropBlock mask. Algorithm 1 corresponds to the latter, which tends to work better in our experiments.
对于 channels， 不同的 feature map 具有不同的 dropblock 相比所有的 channels 共享 dropblock 效果要好。
Similar to dropout we do not apply DropBlock during inference. This is interpreted as evaluating an averaged prediction across the exponentially-sized ensemble of sub-networks. These sub-networks include a special subset of sub-networks covered by dropout where each network does not see contiguous parts of feature maps.
关于 infer 时， dropblock 的处理和 dropout 类似。
In our implementation, we set a constant block_size for all feature maps, regardless the resolution of feature map. DropBlock resembles dropout  when block_size = 1 and resembles SpatialDropout  when block_size covers the full feature map.
block_size 设置为 1 时, 类似于 dropout. 当 block_size 设置为整个 feature map 的 size 大小时，就类似于 SpatialDropout.
setting the value of $\gamma$:
In practice, we do not explicitly set $\gamma$. As stated earlier, $\gamma$ controls the number of features to drop. Suppose that we want to keep every activation unit with the probability of keep_prob, in dropout  the binary mask will be sampled with the Bernoulli distribution with mean 1 − keep_prob. However, to account for the fact that every zero entry in the mask will be expanded by block_size2 and the blocks will be fully contained in feature map, we need to adjust $\gamma$ accordingly when we sample the initial binary mask. In our implementation, $\gamma$ can be computed as
作者并没有显示的设置 $\gamma$. 对于 dropout，每一个 unit 满足概率为 keep_prob 的 Bernoulli 分布，但是对于 dropblock, 需要考虑到 block_size 的大小，以及其与 feature map size 的比例大小。
keep_prob 是传统的 dropout 的概率，通常设置为 0.75-0.9.
feat_size 是整个 feature map 的 size 大小。
(feat_size - block_size + 1) 是选择 dropblock 中心位置的有效区域。
The main nuance of DropBlock is that there will be some overlapped in the dropped blocks, so the above equation is only an approximation.
最主要的问题是，会出现 block_size 的重叠。所以上诉公式也只是个近似。
We found that DropBlock with a fixed keep_prob during training does not work well. Applying small value of keep_prob hurts learning at the beginning. Instead, gradually decreasing keep_prob over time from 1 to the target value is more robust and adds improvement for the most values of keep_prob.
定制化的设置 keep_prob, 在网络初期丢失特征会降低 preformance, 所以刚开始设置为 1,然后逐渐减小到 target value.
所以是随着网络深度加深而变化，还是随着迭代步数变化，应该是后者吧，类似于 scheduled learning rate.
In the following experiments, we study where to apply DropBlock in residual networks. We experimented with applying DropBlock only after convolution layers or applying DropBlock after both convolution layers and skip connections. To study the performance of DropBlock applying to different feature groups, we experimented with applying DropBlock to Group 4 or to both Groups 3 and 4.
实验主要在讨论在哪儿加 dropblock 以及 如何在 channels 中加 dropblock。