论文笔记-video transformer

paper list:
- Training data-efficient image transformers & distillation through attention. - An image is worth 16x16 words: Transformers for image recognition at scale.
- ViViT: A Video Vision Transformer.
- Is space-time attention all you need for video understanding
- Video transformer network.

ViViT: A Video Vision Transformer

作者通过大量实验来研究 vision transformer,并探索最合适的且efficient的结构使其能适用于small datasets. 其中,主要包括以下三个方面的探索:
- tokenisation strategies
- model architecture
- regularisation methods.

transformer-based architecture的缺点:相比卷积网络,transformer缺乏了一定的归纳偏置能力,比如平移不变性。因此,需要大量的数据来学习。

Two approaches for extracting tokens from video