# 论文笔记-video transformer

paper list:

• Training data-efficient image transformers & distillation through attention.
• An image is worth 16x16 words: Transformers for image recognition at scale.
• ViViT: A Video Vision Transformer.
• Is space-time attention all you need for video understanding
• Video transformer network.
• Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
• CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
• What Makes for Hierarchical Vision Transformer?
• WiderNet Go Wider Instead of Deeper
• CoAtNet: Marrying Convolution and Attention for All Data Sizes

## ViViT: A Video Vision Transformer

• tokenization strategies
• model architecture
• regularisation methods.

transformer-based architecture的缺点：相比卷积网络，transformer缺乏了一定的归纳偏置能力，比如平移不变性。因此，需要大量的数据来学习。

### CoAtNet

• convolutional layers tend to have better generalization with faster converging speed thanks to their strong prior of inductive bias
• attention layers have higher model capacity that can benefit from larger datasets.

#### Motivation

• how to effectively combine convolution and attention to achieve better trade-offs between accuracy and efficiency.

#### Contribution

In this paper, we investigate two key insights:

• First, we observe that the commonly used depthwise convolution can be effectively merged into attention layers with simple relative attention;
• Second, simply stacking convolutional and attention layers, in a proper way, could be surprisingly effective to achieve better generalization and capacity.
• SOTA performances under comparable resource constraints across different data sizes

#### Model Architecture

• How to combine the convolution and self-attention within one basic computational block?
• How to vertically stack different types of computational blocks together to form a complete network?
##### Merging convolution and self-attention
• convolution relies on a fixed kernel to gather information from a local receptive field

• self-attention allows the receptive field to be the entire spatial locations and computes the weights based on the re-normalized pairwise similarity between the pair $(x_i, x_j)$

• the good properties of conv and attn:

• the conv kernel is input-independent, but the attention weight dynamically depends on the input, it is much easier for the self-attention to capture complicated relational interactions between different spatial positions.
• translation equivalence of the conv is lacked in attn.
• the size of the receptive field is different between conv and attn.
##### Combine the desirable properties

$y_i^{pre}$ corresponds to a particular variant of relative self-attention. （pre就是相对位置attention的变种啊！这么解释，很神奇！）

#### Vertical Layout Design

to overcome quadratic complexity of self-attention, there three options:

• (A) down-sampling to reduce the spatial size and employ the global relative attention after the feature map reaches manageable level.
• (B) Enforce local attention
• (C) Replace the quadratic Softmax attention with certain linear attention variant which only has a linear complexity w.r.t. the spatial size

the authors experiment with C without getting reasonable good results. B requires many shape formatting operation which are slow on TPU. THus they select the option A.

Filve variants: ViT$_{rel}$ , CTTT, CCTT, CCCT, CCCC.

• From the ImageNet-1K results, in terms of generalization capability:
• As for model capacity, from the JFT comparison:

• To decide the CCTT and CTTT, they conduct another transferability test

Xie Pan

2021-04-29

2021-09-29