dropblock 是关于 CNN 的，后两篇是关于 RNN 的正则化。

# DropBlock

## Motivation

Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout.

Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices.

## dropblock

In this paper, we introduce DropBlock, a structured form of dropout, that is particularly effective to regularize convolutional networks. In DropBlock, features in a block, i.e., a contiguous region of a feature map, are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data (see Figure 1).

- block_size is the size of the block to be dropped
- $$\gamma$$ controls how many activation units to drop.

We experimented with a shared DropBlock mask across different feature channels or each feature channel has its DropBlock mask. Algorithm 1 corresponds to the latter, which tends to work better in our experiments.

Similar to dropout we do not apply DropBlock during inference. This is interpreted as evaluating an averaged prediction across the exponentially-sized ensemble of sub-networks. These sub-networks include a special subset of sub-networks covered by dropout where each network does not see contiguous parts of feature maps.

block_size:
> In our implementation, we set a constant block_size for all feature maps, regardless the resolution of feature map. DropBlock resembles dropout [1] when block_size = 1 and resembles SpatialDropout [20] when block_size covers the full feature map.
block_size 设置为 1 时, 类似于 dropout. 当 block_size 设置为整个 feature map 的 size 大小时，就类似于 SpatialDropout.

setting the value of $$\gamma$$:
> In practice, we do not explicitly set $$\gamma$$. As stated earlier, $$\gamma$$ controls the number of features to drop. Suppose that we want to keep every activation unit with the probability of keep_prob, in dropout [1] the binary mask will be sampled with the Bernoulli distribution with mean 1 − keep_prob. However, to account for the fact that every zero entry in the mask will be expanded by block_size2 and the blocks will be fully contained in feature map, we need to adjust $$\gamma$$ accordingly when we sample the initial binary mask. In our implementation, $$\gamma$$ can be computed as

• keep_prob 是传统的 dropout 的概率，通常设置为 0.75-0.9.
• feat_size 是整个 feature map 的 size 大小。
• (feat_size - block_size + 1) 是选择 dropblock 中心位置的有效区域。

The main nuance of DropBlock is that there will be some overlapped in the dropped blocks, so the above equation is only an approximation.

Scheduled DropBlock:
> We found that DropBlock with a fixed keep_prob during training does not work well. Applying small value of keep_prob hurts learning at the beginning. Instead, gradually decreasing keep_prob over time from 1 to the target value is more robust and adds improvement for the most values of keep_prob.

## Experiments

In the following experiments, we study where to apply DropBlock in residual networks. We experimented with applying DropBlock only after convolution layers or applying DropBlock after both convolution layers and skip connections. To study the performance of DropBlock applying to different feature groups, we experimented with applying DropBlock to Group 4 or to both Groups 3 and 4.