# 论文笔记-video generation

• GenHFi: Generating high fidelity images with subscale pixel networks and multidimensional upscaling. (ICLR2019)
• paper2: Scaling autoregressive video models (ICLR2020)
• Video pixel networks. (CoRR2016)
• Parallel: Parallel multiscale autoregressive density estimation
• VQGAN: Taming Transformers for High-Resolution Image Synthesis
• TeCoGAN: Learning Temporal Coherence via Self-Supervision for GAN-based Video Generation
• ImaGINator: Conditional Spatio-Temporal GAN for Video Generation
• Temporal Shift GAN for Large Scale Video Generation
• MoCoGAN: Decomposing Motion and Content for Video Generation
• Playable Video Generation (CVPR2021)

## GenHFi

title: Generating high fidelity images with subscale pixel networks and multidimensional upscaling

### Abstract:

• Subscale Pixel Network (SPN): a conditional decoder architecture that generates an image as a sequence of sub-images of equal size
• Multidimensional Upscaling: grow an image in both size and depth via intermediate stages utilising distinct SPNs

### Introduction

• The multi-facted relationship between MLE scores and the fidelity of samples
• MLE is a well-defined measure as improvements in held-out scores generally produce improvements in the visual fidelity of the samples.
• MLE forces the model to support the entire empirical distribution. This guarantees the model’s ability to generalize at the cost of allotting capacity to parts of the distribution that are irrelevant to fidelity.
• A 256 × 256 × 3 image has a total of 196,608 positions that need to be architecturally connected in order to learn dependencies among them.

### Contribution

• Multidimensional Upscaling

• Small size, lower depth -> large size, lower depth -> large size, high depth
• Subscale Pixel Network (SPN) architecture

• divides an image of size $N\times N$ into sub-images of size $\dfrac{N}{S}\times \dfrac{N}{S}$ sliced out at interleaving positions

• SPN consists of two networks, a conditioning network that embeds previous slices and a decoder proper that predicts a single target slice given the context embedding.

### VQ-GAN

#### Contribution

• take the convolution and transformer together:

use a convolutional approach to efficiently learn a codebook of context-rich visual parts and, subsequently, learn a model of their global compositions. The long-range interactions within these compositions require an expressive transformer architecture to model distributions over their consituent visual parts.

• utilize an adversarial approach to ensure that the dictionary of local parts captures perceptually important local structure to alleviate the need for modeling low-level statistics with the transformer architecture.

#### Approach

• Learning an Effective Codebook of Image Constituents

• VA-VAE: $$\hat x = G(z_q) = G(q(E(x)))$$

• Learning a Perceptually Rich Codebook using GAN and perception loss

• Learning the Composition of Images with Transformers

• latent transformer

• Conditioned Synthesis

### TeCoGAN

paper: Learning Temporal Coherence via Self-Supervision for GAN-based Video Generation

#### Contribution

• propose a novel Ping-Pong loss to improve the long-term temporal consistency.
• It effectively prevents recurrent networks from accumulating artifacts temporally without depressing detailed features
• propose a first set of metrics to quantitatively evaluate the accuracy as well as the perceptual quality of the temporal evolution.
• Sequential generation tasks: [Kim et al. 2019; Xie et al. 2018].

• Conditional video generation tasks [Jamriška et al. 2019; Sitzmann et al. 2018; Wronski et al. 2019; Zhang et al. 2019]

• motion estimation [Dosovitskiy et al. 2015; Liu et al. 2019]

• explicitly using variants of optical flow networks [Caballero et al. 2017; Sajjadi et al. 2018; Shi et al. 2016]

• GANS:

• Zhu et al. [2017] focuses on images without temporal constrains
• RecycleGAN [Bansal et al. 2018] proposes to use a prediction network in addition to a generator
• a concurrent work [Chen et al. 2019] chose to learn motion translation in addition to the spatial content translation.
• Metrics:

• perceptual metrics [Prashnani et al. 2018; Zhang et al. 2018] are proposed to reliably consider semantic features instead of pixel-wise errors.

• tempoGAN for fluid flow [Xie et al. 2018]
• vid2vid for video translation [Wang et al. 2018a]
• 3D discriminator

• DeepFovea [Kaplanyan et al. 2019]
• Bashkirova et al. [2018]
• For tracking and optical flow estimation, L2-based time-cycle losses [Wang et al. 2019b]

#### Method

Notation:

• $\alpha$ Input domain, $b$ target domain, $g$ generated domain

• $w$ motion compensation,

##### Self-Supervision for Long-term Temporal Consistency
• When inferring this in a frame-recurrent manner, the generated result should not strengthen any invalid features from frame to frame. Rather, the result should stay close to valid information and be symmetric, i.e., the forward result $g_t=G(a_t, {t-1})$ and the one generated from the reversed part, ${g_t}^{‘}=G(a_t, {g{t+1}}^{‘})$, should be identical.

• bi-directional “Ping-Pong” loss

​ $$\mathcal{L}{pp}=\sum{t=1}^{n-1}||g_t-{g_t}^{‘}||_2$$

Parallel multiscale

Xie Pan

2021-09-17

2021-10-31