论文笔记-无监督机器翻译

Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation

王威廉老师组的一篇文章,大致看了下跟最近自己做的研究相关性挺大的。文中也简单的介绍了无监督机器翻译的一些方法,所以借这个机会把无监督机器翻译也好好了解下。记得在三星研究院实习时,有个中科院自动化所的师姐(据说是宗成庆老师的学生)说过一句话,2018年是无监督机器翻译元年。但当时我在搞QA,就没怎么深入研究。感觉很多NLP其他方向的做法都是源于 NMT,所以还是很有必要看一下的。

Motivation

Back-translation 得到的伪平行语料,是基于 pure target sentence 得到 pesudo source sentence,然后把 prue target sentence 作为 label 进行监督学习(保证target 端是pure sentence,source端的sentence可以稍微 noisy)。这实质上就是一个 reconstruction loss. 其缺点在于 pesudo source sentence 质量无法保证,会导致误差累积(pesudo source sentence 并没有得到更新,所以并没有纠正存在的错误)。

基于此,作者提出了一种新的范式,extract-edit.

单语语料的选择

neural-based methods aim to select potential parallel sentences from monolingual corpora in the same domain. However, these neural models need to be trained on a large parallel dataset first, which is not applicable to language pairs with limited supervision. 

通过平行语料训练翻译模型,进而从单语中选择 domain related sentences. 这并不是完全的无监督,还是需要有限的平行语料得到 NMT 模型之后,去选择合适的单语。

  • Parallel sentence extraction from comparable corpora with neural network features, LERC 2016

  • Bilingual word embeddings with bucketed cnn for parallel sentence extraction, ACL 2017

  • Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation, COLING 2018

完全的无监督机器翻译

The main technical protocol of these approaches can be summarized as three steps: 

  • Initialization

  • Language Modeling

  • Back-Translation

Initialization

Given the ill-posed nature of the unsupervised NMT task, a suitable initialization method can help model the natural priors over the mapping of two language spaces we expect to reach.

初始化的目的基于自然语言的一些先验知识来对两种语言的映射关系进行建模。

there two main initiazation methods:

  • bilingual dictionary inference 基于双语词典的推理

    • Word translation without parallel data. Conneau, et al. ICLR 2018

    • Unsupervised neural machine translation, ICLR 2018

    • Unsupervised machine translation using monolingual corpora only, ICLR 2018a

  • BPE

    • Phrase-based & neural unsupervised machine translation. emnlp Lample et al. 2018b

本文作者采用的是 Conneau, et al. 中的方式,并且类似于 Lample 2018b 中的方式两种语言共享 bpe(需要在看下相关论文). 这里实际上就是训练得到两种语言的 word embedding,并不是 word2vec 那种对单种语言的无监督,而是训练得到两种语言的 share embedding.

language modeling

Train language models on both source and target languages. These models express a data-driven prior about the composition of sentences in each language.

在初始化之后,在 share embedding 的基础上分别对 source 和 target 的语言进行建模。

In NMT, language modeling is accomplished via denosing autoencoding, by minimizing:

本文作者采用的 Lample 2018a 的方式。共享 encoder 和 decoder 的参数???

Back-Translation

  • Dual learning for machine translation, NIPS 2016

  • Improving neural machine translation models with monolingual data. ACL 2016

Extract-Edit

  • Extract: 先根据前两步得到的 sentence 表示,从 target language space 中选择与 source sentence 最接近的 sentence(依据相似度?).

  • Edit: 然后对选择的 sentence 进行 edit.

作者还提出了一个 comparative translation loss。

Extract

因为在 language model 阶段作者已经共享了 encoder 和 decoder,所以在这个场景下对于 two language 的表示,都可以用 encoder 得到。

在 target language space 中选择出与 source sentence 最接近的 top-k extracted sentences. 为什么是 top-k 而不是 top-1 呢,确保召回率,并获得更多更相关的 samples.

Edit

简单点就是 max-pooling + decode

employ a maxpooling layer to reserve the more significant features between the source sentence embedding $e_s$ and the extracted sentence embedding $e_t$ ($t\in M$), and then decode it into a new sentence $t’$.

具体是怎么操作的呢,这似乎需要看代码。

$e_s$: [es_length, encoder_size]

$e_t$: [et_length, encoder_size]

这怎么 max-pooling 呢(句子长度都可能不一样),然后 decode 得到新的 sentence 吧。。

Evaluate

虽然 M’ 中可能存在潜在的 parallel sentence 对应 source sentence s. 但是依然不能用 (s, t’) 作为 ground-truth stence pairs 来训练 NMT 模型。因为 NMT 模型对噪声非常敏感。

作者提出了一个 evaluation network R, 实际上就是多层感知机,也许是个两层神经网络吧,具体没说。two labguage 共享 R.

$$r_s=f(W_2f(W_1e_s+b_1)+b_2)$$

$$r_t=f(W_2f(W_1e_t’+b_1)+b_2)$$

假设是这样,也就是将 t’ 转换成 t* 了。

理解错了

其目的是将 s 和 t’ 映射到同一向量空间,然后计算两者的相似度:

接下来将 $\alpha$ 转换成概率分布。 也就是计算 top-k 个 extracted-edited 得到的 target sentences t* 与 source sentence s 相似的概率,并且这些概率相加为 1.

其中 $\lambda$ 可以看作是 inverse temperature, $\lambda$ 越小,表示所有 t* 平等看待,越大,表示更看重 $\alpha$ 最大的那一句。显然前面的 $\alpha$ 是通过 cosine 计算的,也就是更看重 k 个 t* 中与 s 距离最近的那个 sentence.

learning

Comparative Translation

cosine 相似度越大越接近,所以 -logP 越小越好。这里面涉及到的参数 $\theta_{enc}, \theta_R$

Basically, the translation model is trying to minimize the relative distance of the translated sentence t* to the source sentence s compared to the top-k extracted-and-edited sentences in the target language space. Intuitively, we view the top-k extracted-and-edited sentences as the anchor points to locate a probable region in the target language space, and iteratively improve the source-to-target mapping via the comparative learning scheme.

Adversarial Objective

we can view our translation system as a “generator” that learns to generate a good translation with a higher similarity score than the extracted-and-edited sentences, and the evaluation network R as a “discriminator” that learns to rank the extracted- and-edited sentences (real sentences in the target language space) higher than the translated sentences.

借助于对抗学习的思想,可以把 translation system 看作是 生成器 generator, 用来学习得到 translated target sentence,使得其优于 extracted-and-edited sentences.

把 evalution newtork R 看作是判别器,其目的就是判别 extracted-and-edited sentences 优于 translated target sentences.

因此对于 evaluation network R,有

final adversarial objective

Model selection

无监督学习因为没有平行语料,所以需要一个指标来表示模型的好坏,也就是翻译质量。

Basically, we choose the hyper-parameters with the maximum expectation of the ranking scores of all translated sentences.

Implementation details

Initialization

cross-lingual BPE embedding, set BPE number 60000.

然后用 Fasttext 训练得到 embedding, 512 dimension. 其中 Fasettext 设置 window size 5 and 10 negative samples

Model structure

all encoder parameters are shared across two languages. Similarly, we share all decoder parameters across two languages.

The λ for calculating ranking scores is 0.5. As for the evaluation network R, we use a multilayer perceptron with two hidden layers of size 512.

论文笔记-Using monoligual data in machine transaltion

Monolingual Data in NMT

Why Monolingual data enhancement

  • Large scale source-side data:

enhancing encoder network to obtain high quality context vector

representation of source sentence.

  • Large scale target-side data:

boosting fluency for machine translation when decoding.

The methods of using monolingual data

Multi-task learning

Target-side language model: Integrating Language Model into the Decoder

shallow fusion

both an NMT model (on parallel corpora) as well as a recurrent neural network language model (RNNLM, on larger monolingual corpora) have been pre-trained separately before being integrated.

Shallow fusion: rescore the probability of the candidate words.

deep fusion

multi-task learning

Using Target-side Monolingual Data for Neural Machine Translation through Multi-task Learning, EMNLP, 2017

利用 target-side 的单语多了一个训练语言模型的任务。事实上(b)就是上一张 PPT 中的方法,这篇paper在这个基础上增加了语言模型的 loss。

$\sigma$ 参数在两个任务训练时都会更新。而 $\theta$ 参数仅仅在训练翻译模型时才会更新参数。

auto-encoder

通过 自编码 的形式,重构对应的 mono-data,作为辅助任务,与 NMT 模型共享 encoder 参数。

Semi-Supervised Learning for Neural Machine Translation, ACL, 2016

Back-translation

What is back-translation?

Synthetic pseudo parallel data from target-side monolingual data using a reverse translation model.

why back-translation and motivation?

It mitigates the problem of overfitting and fluency by exploiting additional data in the target language.

目标语言必须始终是真实句子才能让翻译模型翻译的结果更流畅、更准确,而源语言即便有少量用词不当、语序不对、语法错误,只要不影响理解就无所谓。其实人做翻译的时候也是一样的:翻译质量取决于一个人译出语言的水平,而不是源语言的水平(源语言的水平只要足够看懂句子即可)

Different aspects of the BT which influence the performance of translation:

  • Size of the Synthetic Data

  • Direction of Back-Translation

  • Quality of the Synthetic Data

Size of the Synthetic Data

Direction of Back-Translation

Quality of the Synthetic Data

copy mechanism

作者的实验设置:用 target-side mono-data 来构建伪平行语料,一部分是直接 copy,另一部分是通过 back-translate 得到的。也就是 mono-data 出现了两次。

总觉得哪里不对。。。

Dummy source sentence

Pseudo parallel data:

+ target-side mono-data

The downside:

the network ‘unlearns’ its conditioning on the source context if the ratio of monolingual training instances is too high.

Improving Neural Machine Translation Models with Monolingual Data, Sennrich et al, ACL 2016

Self-learning

Synthetic target sentences from source-side mono-data:

  • Build a baseline machine translation (MT) system on parallel data

  • Translate source-side mono-data into target sentences

  • Real parallel data + pseudo parallel data

reference

  1. Improving Neural Machine Translation Models with Monolingual Data, Sennrich et al, ACL 2016

  2. Using Monolingual Data in Neural Machine Translation: a Systematic Study, Burlot et al. ACL 2018

  3. Copied Monolingual Data Improves Low-Resource Neural Machine Translation, Currey et al. 2017 In Proceedings of the Second Conference on Machine Translation

  4. Semi-Supervised Learning for Neural Machine Translation, Cheng et al. ACL 2016

  5. Exploiting Source-side Monolingual Data in Neural Machine Translation, Zhang et al. EMNLP 2016

  6. Using Target-side Monolingual Data for Neural Machine Translation through Multi-task Learning, Domhan et al. EMNLP 2018

On Using Monolingual Corpora in Neural Machine Translation, Gulcehre, 2015

  1. Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation, EMNLP 2018

  2. Understanding Back-Translation at Scale, Edunov et al. EMNLP 2018