- DALL-E: Zero-Shot Text-to-Image Generation
- BEIT: BERT Pre-Training of Image Transformers
- Discrete representations strengthen vision transformer robustness
- IBOT: Image BERT Pre-training with online tokenizer
- Masked Autoencoders Are Scalable Vision Learners
- VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning
- SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition
阅读更多发表更新generation5 分钟读完 (大约746个字)
- GenHFi: Generating high fidelity images with subscale pixel networks and multidimensional upscaling. (ICLR2019)
- paper2: Scaling autoregressive video models (ICLR2020)
- Video pixel networks. (CoRR2016)
- Parallel: Parallel multiscale autoregressive density estimation
- VQGAN: Taming Transformers for High-Resolution Image Synthesis
- TeCoGAN: Learning Temporal Coherence via Self-Supervision for GAN-based Video Generation
- ImaGINator: Conditional Spatio-Temporal GAN for Video Generation
- Temporal Shift GAN for Large Scale Video Generation
- MoCoGAN: Decomposing Motion and Content for Video Generation
- Playable Video Generation (CVPR2021)
阅读更多发表更新generation37 分钟读完 (大约5593个字)
- VQ-VAE: Neural Discrete Representation Learning (NIPS2017)
- VQ-VAE2: Generating Diverse High-Resolution Images with VQ-VAE-2
- DALL-E: Zero-Shot Text-to-Image Generation
- VideoGPT: Video Generation using VQ-VAE and Transformers
- LVT: Latent Video Transformer
- Feature Quantization Improves GAN Training (ICML2020)
- DVT-NAT: Fast Decoding in Sequence Models Using Discrete Latent Variables (ICML2018)
- NWT: Towards natural audio-to-video generation with representation learning
- NUWA: Visual Synthesis Pre-training for Neural visUal World creAtion
阅读更多