论文笔记-video transformer

paper list:

  • Training data-efficient image transformers & distillation through attention.
  • An image is worth 16x16 words: Transformers for image recognition at scale.
  • ViViT: A Video Vision Transformer.
  • Is space-time attention all you need for video understanding
  • Video transformer network.
  • Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
  • CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

DeiT

利用attention机制设计了一种特殊的知识蒸馏方式,效果不错。

ViViT: A Video Vision Transformer

作者通过大量实验来研究 vision transformer,并探索最合适的且efficient的结构使其能适用于small datasets. 其中,主要包括以下三个方面的探索:

  • tokenization strategies
  • model architecture
  • regularisation methods.

transformer-based architecture的缺点:相比卷积网络,transformer缺乏了一定的归纳偏置能力,比如平移不变性。因此,需要大量的数据来学习。

作者

Xie Pan

发布于

2021-04-29

更新于

2021-07-06

许可协议

评论