XLM

Cross-lingual language models

Shared sub-word vocabulary

Sentences are sampled according to a multinomial distribution with probabilities ${q_i}_{i=1...N}$, where:

Causal Language Modeling (CLM)

$p(w_t|w_1,...,w_{t-1},\theta)$

Character-Level Language Modeling with Deeper Self-Attention 这篇paper使用的self-attention, 我们知道self-attention 不像rnn那样具有hidden state的概率，这篇paper把上一个batch作为下一个batch的context，有点类似于 transformer-XL,但是这对于cross-lingual不太适合，所以这里的 CLM 与传统的language model完全一致。

> Differences between our approach and the MLM of Devlin et al. (2018) include the use of text streams of an arbitrary number of sentences (truncated at 256 tokens) instead of pairs of sentences.

> tokens in a text stream are sampled according to a multinomial distribution, whose weights are proportional to the square root of their invert frequencies.

Cross-lingual Language Models

In this work, we consider cross-lingual language model pretraining with either CLM, MLM, or MLM used in combination with TLM. For the CLM and MLM objectives, we train the model with batches of 64 streams of continuous sentences composed of 256 tokens. At each iteration, a batch is composed of sentences coming from the same language, which is sampled from the distribution ${q_i}_{i=1...N}$ above, with α = 0.7. When TLM is used in combination with MLM, we alternate between these two objectives, and sample the language pairs with a similar approach.