论文笔记-video transformer

paper list:

  • Training data-efficient image transformers & distillation through attention.
  • An image is worth 16x16 words: Transformers for image recognition at scale.
  • ViViT: A Video Vision Transformer.
  • Is space-time attention all you need for video understanding
  • Video transformer network.
  • Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
  • CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
  • What Makes for Hierarchical Vision Transformer?
  • WiderNet Go Wider Instead of Deeper

DeiT

利用attention机制设计了一种特殊的知识蒸馏机制,能显著提升收敛速度,并一定程度提高准确率。

ViViT: A Video Vision Transformer

作者通过大量实验来研究 vision transformer,并探索最合适的且efficient的结构使其能适用于small datasets. 其中,主要包括以下三个方面的探索:

  • tokenization strategies
  • model architecture
  • regularisation methods.

transformer-based architecture的缺点:相比卷积网络,transformer缺乏了一定的归纳偏置能力,比如平移不变性。因此,需要大量的数据来学习。


WiderNet

![image-20210729094307263](/Users/xiepan/Library/Application Support/typora-user-images/image-20210729094307263.png)

作者

Xie Pan

发布于

2021-04-29

更新于

2021-07-29

许可协议

评论