论文笔记-无监督机器翻译 Extract and Edit

Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation 王威廉老师组的一篇文章,大致看了下跟最近自己做的研究相关性挺大的。文中也简单的介绍了无监督机器翻译的一些方法,所以借这个机会把无监督机器翻译也好好了解下。记得在三星研究院实习时,有个中科院自动化所的师姐(据说是宗成庆老师的学生)说过一句话,2018年是无监督机器翻译元年。但当时我在搞QA,就没怎么深入研究。感觉很多NLP其他方向的做法都是源于 NMT,所以还是很有必要看一下的。

Motivation

Back-translation 得到的伪平行语料,是基于 pure target sentence 得到 pesudo source sentence,然后把 prue target sentence 作为 label 进行监督学习。这实质上就是一个 reconstruction loss. 其缺点在于 pesudo source sentence 质量无法保证,会导致误差累积(pesudo source sentence 并没有得到更新,所以并没有纠正存在的错误)。

基于此,作者提出了一种新的范式,extract-edit.

单语语料的选择

neural-based methods aim to select potential parallel sentences from monolingual corpora in the same domain. However, these neural models need to be trained on a large parallel dataset first, which is not applicable to language pairs with limited supervision. 
通过平行语料训练翻译模型,进而从单语中选择 domain related sentences. 这并不是完全的无监督,还是需要有限的平行语料得到 NMT 模型之后,去选择合适的单语。

  • Parallel sentence extraction from comparable corpora with neural network features, LERC 2016
  • Bilingual word embeddings with bucketed cnn for parallel sentence extraction, ACL 2017
  • Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation, COLING 2018

完全的无监督机器翻译

  • Unsupervised machine translation using monolingual corpora only, Lample, et al. ICLR 2018a
  • Unsupervised neural machine translation, ICLR 2018
  • Phrase-based & neural unsupervised machine translation. emnlp 2018b

The main technical protocol of these approaches can be summarized as three steps: 
- Initialization
- Language Modeling
- Back-Translation

Initialization

Given the ill-posed nature of the unsupervised NMT task, a suitable initialization method can help model the natural priors over the mapping of two language spaces we expect to reach.
初始化的目的基于自然语言的一些先验知识来对两种语言的映射关系进行建模。

there two main initiazation methods:
- bilingual dictionary inference 基于双语词典的推理
- Word translation without parallel data. Conneau, et al. ICLR 2018
- Unsupervised neural machine translation, ICLR 2018
- Unsupervised machine translation using monolingual corpora only, ICLR 2018a
- BPE
- Phrase-based & neural unsupervised machine translation. emnlp Lample et al. 2018b

本文作者采用的是 Conneau, et al. 中的方式,并且类似于 Lample 2018b 中的方式两种语言共享 bpe(需要在看下相关论文). 这里实际上就是训练得到两种语言的 word embedding,并不是 word2vec 那种对单种语言的无监督,而是训练得到两种语言的 share embedding.

language modeling

Train language models on both source and target languages. These models express a data-driven prior about the composition of sentences in each language.
在初始化之后,在 share embedding 的基础上分别对 source 和 target 的语言进行建模。

In NMT, language modeling is accomplished via denosing autoencoding, by minimizing:

本文作者采用的 Lample 2018a 的方式。共享 encoder 和 decoder 的参数???

Back-Translation

  • Dual learning for machine translation, NIPS 2016
  • Improving neural machine translation models with monolingual data. ACL 2016

Extract-Edit

  • Extract: 先根据前两步得到的 sentence 表示,从 target language space 中选择与 source sentence 最接近的 sentence(依据相似度?).
  • Edit: 然后对选择的 sentence 进行 edit.

作者还提出了一个 comparative translation loss。

Extract

因为在 language model 阶段作者已经共享了 encoder 和 decoder,所以在这个场景下对于 two language 的表示,都可以用 encoder 得到。

在 target language space 中选择出与 source sentence 最接近的 top-k extracted sentences. 为什么是 top-k 而不是 top-1 呢,确保召回率,并获得更多更相关的 samples.

Edit

简单点就是 max-pooling + decode

employ a maxpooling layer to reserve the more significant features between the source sentence embedding \(e_s\) and the extracted sentence embedding \(e_t\) (\(t\in M\)), and then decode it into a new sentence \(t'\).

具体是怎么操作的呢,这似乎需要看代码。

\(e_s\): [es_length, encoder_size]
\(e_t\): [et_length, encoder_size]

这怎么 max-pooling 呢(句子长度都可能不一样),然后 decode 得到新的 sentence 吧。。

Evaluate

虽然 M' 中可能存在潜在的 parallel sentence 对应 source sentence s. 但是依然不能用 (s, t') 作为 ground-truth stence pairs 来训练 NMT 模型。因为 NMT 模型对噪声非常敏感。

作者提出了一个 evaluation network R, 实际上就是多层感知机,也许是个两层神经网络吧,具体没说。two labguage 共享 R.

\[r_s=f(W_2f(W_1e_s+b_1)+b_2)\]

\[r_t=f(W_2f(W_1e_t'+b_1)+b_2)\]

假设是这样,也就是将 t' 转换成 t* 了。 理解错了

其目的是将 s 和 t' 映射到同一向量空间,然后计算两者的相似度:

接下来将 \(\alpha\) 转换成概率分布。 也就是计算 top-k 个 extracted-edited 得到的 target sentences t* 与 source sentence s 相似的概率,并且这些概率相加为 1.

其中 \(\lambda\) 可以看作是 inverse temperature, \(\lambda\) 越小,表示所有 t* 平等看待,越大,表示更看重 \(\alpha\) 最大的那一句。显然前面的 \(\alpha\) 是通过 cosine 计算的,也就是更看重 k 个 t* 中与 s 距离最近的那个 sentence.

learning

Comparative Translation

cosine 相似度越大越接近,所以 -logP 越小越好。这里面涉及到的参数 \(\theta_{enc}, \theta_R\)

Basically, the translation model is trying to minimize the relative distance of the translated sentence t* to the source sentence s compared to the top-k extracted-and-edited sentences in the target language space. Intuitively, we view the top-k extracted-and-edited sentences as the anchor points to locate a probable region in the target language space, and iteratively improve the source-to-target mapping via the comparative learning scheme.

Adversarial Objective > we can view our translation system as a “generator” that learns to generate a good translation with a higher similarity score than the extracted-and-edited sentences, and the evaluation network R as a “discriminator” that learns to rank the extracted- and-edited sentences (real sentences in the target language space) higher than the translated sentences.

借助于对抗学习的思想,可以把 translation system 看作是 生成器 generator, 用来学习得到 translated target sentence,使得其优于 extracted-and-edited sentences.

把 evalution newtork R 看作是判别器,其目的就是判别 extracted-and-edited sentences 优于 translated target sentences.

因此对于 evaluation network R,有

final adversarial objective

Model selection

无监督学习因为没有平行语料,所以需要一个指标来表示模型的好坏,也就是翻译质量。

Basically, we choose the hyper-parameters with the maximum expectation of the ranking scores of all translated sentences.

Implementation details

Initialization

cross-lingual BPE embedding, set BPE number 60000.

然后用 Fasttext 训练得到 embedding, 512 dimension. 其中 Fasettext 设置 window size 5 and 10 negative samples

Model structure

all encoder parameters are shared across two languages. Similarly, we share all decoder parameters across two languages.

The λ for calculating ranking scores is 0.5. As for the evaluation network R, we use a multilayer perceptron with two hidden layers of size 512.