论文笔记-video transformer

paper list:

  • Training data-efficient image transformers & distillation through attention.
  • An image is worth 16x16 words: Transformers for image recognition at scale.
  • ViViT: A Video Vision Transformer.
  • Is space-time attention all you need for video understanding
  • Video transformer network.
  • Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
  • CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
  • What Makes for Hierarchical Vision Transformer?
  • WiderNet Go Wider Instead of Deeper
  • CoAtNet: Marrying Convolution and Attention for All Data Sizes

DeiT

利用attention机制设计了一种特殊的知识蒸馏机制,能显著提升收敛速度,并一定程度提高准确率。

ViViT: A Video Vision Transformer

作者通过大量实验来研究 vision transformer,并探索最合适的且efficient的结构使其能适用于small datasets. 其中,主要包括以下三个方面的探索:

  • tokenization strategies
  • model architecture
  • regularisation methods.

transformer-based architecture的缺点:相比卷积网络,transformer缺乏了一定的归纳偏置能力,比如平移不变性。因此,需要大量的数据来学习。


WiderNet

CoAtNet

作者从泛化性generalization和模型容量model capacity两个方面来分析conv和attn的区别:

  • convolutional layers tend to have better generalization with faster converging speed thanks to their strong prior of inductive bias
  • attention layers have higher model capacity that can benefit from larger datasets.

Motivation

  • how to effectively combine convolution and attention to achieve better trade-offs between accuracy and efficiency.

Contribution

In this paper, we investigate two key insights:

  • First, we observe that the commonly used depthwise convolution can be effectively merged into attention layers with simple relative attention;
  • Second, simply stacking convolutional and attention layers, in a proper way, could be surprisingly effective to achieve better generalization and capacity.
  • SOTA performances under comparable resource constraints across different data sizes

Model Architecture

  • How to combine the convolution and self-attention within one basic computational block?
  • How to vertically stack different types of computational blocks together to form a complete network?
Merging convolution and self-attention
  • convolution relies on a fixed kernel to gather information from a local receptive field

  • self-attention allows the receptive field to be the entire spatial locations and computes the weights based on the re-normalized pairwise similarity between the pair $(x_i, x_j)$

  • the good properties of conv and attn:

    • the conv kernel is input-independent, but the attention weight dynamically depends on the input, it is much easier for the self-attention to capture complicated relational interactions between different spatial positions.
    • translation equivalence of the conv is lacked in attn.
    • the size of the receptive field is different between conv and attn.
Combine the desirable properties

$y_i^{pre}$ corresponds to a particular variant of relative self-attention. (pre就是相对位置attention的变种啊!这么解释,很神奇!)

Vertical Layout Design

to overcome quadratic complexity of self-attention, there three options:

  • (A) down-sampling to reduce the spatial size and employ the global relative attention after the feature map reaches manageable level.
  • (B) Enforce local attention
  • (C) Replace the quadratic Softmax attention with certain linear attention variant which only has a linear complexity w.r.t. the spatial size

the authors experiment with C without getting reasonable good results. B requires many shape formatting operation which are slow on TPU. THus they select the option A.

Filve variants: ViT$_{rel}$ , CTTT, CCTT, CCCT, CCCC.

  • From the ImageNet-1K results, in terms of generalization capability:
  • As for model capacity, from the JFT comparison:

    image-20210929134835619

  • To decide the CCTT and CTTT, they conduct another transferability test

作者

Xie Pan

发布于

2021-04-29

更新于

2021-09-29

许可协议

评论