从0开始GAN-6-pretraining for NLG

cross-lingual word embedding

A survey of cross-lingual word embedding models, Ruder et al.2017
Word translation without parallel data. Conneau et al.2017
Mas- sively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. Artetxe et al.2018

contextual word embedding

ELMo
Word2vec
Glove
GPT
ULMFT: Universal language model fine-tuning for text classificatio
Cross-lingual language model pretraining
Polyglot contextual representations im- prove crosslingual transfer

XLM

paper: Cross-lingual Language Model Pretraining

作者提出了两种方法来学习 cross-lingual 语言模型。其中一种是仅基于 monolingual data, 另一种是基于平行语料。在 cross-lingal 相关的任务上都有很大的提升。比如 XNLI,unsupervised machine translation, 以及 supervised machine tranlsation.

Motivation

现有的在NLP领域的发展主要是围绕英文进行的,一些start-of-the-art或者NLP任务的benchmarks都是以英文为基础的。其他的一些语言受限于语料的问题,发展相对缓慢。近期随着cross-lingual sentence representation的发展,消除English-centric bias,并且构建一个通用的cross-lingual encoder来讲任何语言的sentence编码到共享的embedding空间成为可能。

Cross-lingual language models

Shared sub-word vocabulary

使用 bpe,并且不同的language共享词表. > this greatly improves the alignment of embedding spaces across languages that share either the same alphabet or anchor tokens such as digits (Smith et al., 2017) or proper nouns.
共享词表能显著提升那些具有相同字母表或者anchor token(数字或专有名词)的语言之间的向量空间的对齐。

作者先从不同语言的monolingual data中筛选出部分data,然后学习bpe splits.

Sentences are sampled according to a multinomial distribution with probabilities \({q_i}_{i=1...N}\), where:

其中 \(n_i\) 表示第 i 中语言中sentence的总数。 \(\sum_{k=1}^nn_k\) 表示N种语言所有的sentence的总数。\(p_i\) 则表示第 i 中语言sample的概率。设定 \(\alpha=0.5\),这样能增加 low-resource 的比例,从而减轻 bias to high-resource language.

作者总共提出了三种 language model. 接下来一一介绍:

Causal Language Modeling (CLM)

\(p(w_t|w_1,...,w_{t-1},\theta)\)

也就是普通的 aotu-regressive 语言模型。

Character-Level Language Modeling with Deeper Self-Attention 这篇paper使用的self-attention, 我们知道self-attention 不像rnn那样具有hidden state的概率,这篇paper把上一个batch作为下一个batch的context,有点类似于 transformer-XL,但是这对于cross-lingual不太适合,所以这里的 CLM 与传统的language model完全一致。

Masked Language Modeling (MLM)

与 BERT 中MLM的区别:
> Differences between our approach and the MLM of Devlin et al. (2018) include the use of text streams of an arbitrary number of sentences (truncated at 256 tokens) instead of pairs of sentences.

文本 stream 是任意数量的sentences,而不是pairs.(这里的pairs in BERT应该指的是 next sentence prediction.)

同时为了处理 rare word 和 frequent word(punctuation or stop words) 的不均衡问题:
> tokens in a text stream are sampled according to a multinomial distribution, whose weights are proportional to the square root of their invert frequencies.

Translation Language Modeling (TLM)

在预测一个 masked english word 的同时,不仅可以attend english context,也可以 attend franch translation.

Cross-lingual Language Models

In this work, we consider cross-lingual language model pretraining with either CLM, MLM, or MLM used in combination with TLM. For the CLM and MLM objectives, we train the model with batches of 64 streams of continuous sentences composed of 256 tokens. At each iteration, a batch is composed of sentences coming from the same language, which is sampled from the distribution \({q_i}_{i=1...N}\) above, with α = 0.7. When TLM is used in combination with MLM, we alternate between these two objectives, and sample the language pairs with a similar approach.