- Training data-efficient image transformers & distillation through attention.
- An image is worth 16x16 words: Transformers for image recognition at scale.
- ViViT: A Video Vision Transformer.
- Is space-time attention all you need for video understanding
- Video transformer network.
- Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
- CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
- What Makes for Hierarchical Vision Transformer?
- WiderNet Go Wider Instead of Deeper
- CoAtNet: Marrying Convolution and Attention for All Data Sizes