从0开始GAN-6-pretraining for NLG

pre-trained for NMT

Towards Making the Most of BERT in Neural Machine Translation

Unsupervised Pretraining for Sequence to Sequence Learning

When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

XLM

Shared sub-word vocabulary

this greatly improves the alignment of embedding spaces across languages that share either the same alphabet or anchor tokens such as digits (Smith et al., 2017) or proper nouns.

Sentences are sampled according to a multinomial distribution with probabilities ${q_i}_{i=1…N}$, where:

Causal Language Modeling (CLM)

$p(w_t|w_1,…,w_{t-1},\theta)$

Character-Level Language Modeling with Deeper Self-Attention 这篇paper使用的self-attention, 我们知道self-attention 不像rnn那样具有hidden state的概率，这篇paper把上一个batch作为下一个batch的context，有点类似于 transformer-XL,但是这对于cross-lingual不太适合，所以这里的 CLM 与传统的language model完全一致。

Masked Language Modeling (MLM)

Differences between our approach and the MLM of Devlin et al. (2018) include the use of text streams of an arbitrary number of sentences (truncated at 256 tokens) instead of pairs of sentences.

tokens in a text stream are sampled according to a multinomial distribution, whose weights are proportional to the square root of their invert frequencies.

Translation Language Modeling (TLM)

Cross-lingual Language Models

In this work, we consider cross-lingual language model pretraining with either CLM, MLM, or MLM used in combination with TLM. For the CLM and MLM objectives, we train the model with batches of 64 streams of continuous sentences composed of 256 tokens. At each iteration, a batch is composed of sentences coming from the same language, which is sampled from the distribution ${q_i}_{i=1…N}$ above, with α = 0.7. When TLM is used in combination with MLM, we alternate between these two objectives, and sample the language pairs with a similar approach.

CTNMT

ByteDance 的一篇paper.

Asymptotic Distillation

$$L_{kd}=-||\hat h^{lm}-h_l||^2_2$$

$$L=\alpha\cdot L_{nmt}+(1-\alpha)\cdot L_{kd}$$

Dynamic Switch

$$g = \sigma(Wh^{lm} + Uh^{nmt} + b)$$

$$h=g\odot h^{lm}+(1-g)\odot h^{nmt}$$

Rate-scheduled learning

slanted triangular learning, 斜三角学习率。最开始提出是在 ULMFT: Universal language model fine-tuning for text classificatio 这篇论文中。

$$\theta_t=\theta_{t-1}-\eta\nabla_{\theta}L(\theta)$$

Result

Xie Pan

2019-06-30

2021-06-29