- What Makes for Good Views for Contrastive Learning?
- BYOL works even without batch statistics
- Understanding Self-Supervised Learning Dynamics without Contrastive Pairs
- Big Self-Supervised Models are Strong Semi-Supervised Learners
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere.
- Training data-efficient image transformers & distillation through attention.
- An image is worth 16x16 words: Transformers for image recognition at scale.
- ViViT: A Video Vision Transformer.
- Is space-time attention all you need for video understanding
- Video transformer network.
- Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
- CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
- What Makes for Hierarchical Vision Transformer?
- WiderNet Go Wider Instead of Deeper
- CoAtNet: Marrying Convolution and Attention for All Data Sizes
- CARAFE: Content-Aware ReAssembly of FEatures
- Involution: Inverting the Inherence of Convolution for Visual Recognition
- Pay less attention with lightweight and dynamic convolutions
- ConvBERT: Improving BERT with Span-based Dynamic Convolution
- Dynamic Region-Aware Convolution
- Neural Text Generation with Unlikelihood Training
- Implicit Unlikelihood Training: Improving Neural Text Generation with Reinforcement Learning
- DETR: End-to-End Object Detection with Transformers
- Deformable DETR： Deformable Transformer for End-to-End Object Detection