Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation 王威廉老师组的一篇文章，大致看了下跟最近自己做的研究相关性挺大的。文中也简单的介绍了无监督机器翻译的一些方法，所以借这个机会把无监督机器翻译也好好了解下。记得在三星研究院实习时，有个中科院自动化所的师姐（据说是宗成庆老师的学生）说过一句话，2018年是无监督机器翻译元年。但当时我在搞QA，就没怎么深入研究。感觉很多NLP其他方向的做法都是源于 NMT，所以还是很有必要看一下的。

## Motivation

Back-translation 得到的伪平行语料，是基于 pure target sentence 得到 pesudo source sentence，然后把 prue target sentence 作为 label 进行监督学习。这实质上就是一个 reconstruction loss. 其缺点在于 pesudo source sentence 质量无法保证，会导致误差累积（pesudo source sentence 并没有得到更新，所以并没有纠正存在的错误）。

### 单语语料的选择

neural-based methods aim to select potential parallel sentences from monolingual corpora in the same domain. However, these neural models need to be trained on a large parallel dataset first, which is not applicable to language pairs with limited supervision.

• Parallel sentence extraction from comparable corpora with neural network features, LERC 2016
• Bilingual word embeddings with bucketed cnn for parallel sentence extraction, ACL 2017
• Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation, COLING 2018

### 完全的无监督机器翻译

• Unsupervised machine translation using monolingual corpora only, Lample, et al. ICLR 2018a
• Unsupervised neural machine translation, ICLR 2018
• Phrase-based & neural unsupervised machine translation. emnlp 2018b

The main technical protocol of these approaches can be summarized as three steps:
- Initialization
- Language Modeling
- Back-Translation

#### Initialization

Given the ill-posed nature of the unsupervised NMT task, a suitable initialization method can help model the natural priors over the mapping of two language spaces we expect to reach.

there two main initiazation methods:
- bilingual dictionary inference 基于双语词典的推理
- Word translation without parallel data. Conneau, et al. ICLR 2018
- Unsupervised neural machine translation, ICLR 2018
- Unsupervised machine translation using monolingual corpora only, ICLR 2018a
- BPE
- Phrase-based & neural unsupervised machine translation. emnlp Lample et al. 2018b

#### language modeling

Train language models on both source and target languages. These models express a data-driven prior about the composition of sentences in each language.

In NMT, language modeling is accomplished via denosing autoencoding, by minimizing:

#### Back-Translation

• Dual learning for machine translation, NIPS 2016
• Improving neural machine translation models with monolingual data. ACL 2016

#### Extract-Edit

• Extract: 先根据前两步得到的 sentence 表示，从 target language space 中选择与 source sentence 最接近的 sentence（依据相似度？）.
• Edit: 然后对选择的 sentence 进行 edit.

##### Edit

employ a maxpooling layer to reserve the more significant features between the source sentence embedding $e_s$ and the extracted sentence embedding $e_t$ ($t\in M$), and then decode it into a new sentence $t'$.

$e_s$: [es_length, encoder_size]
$e_t$: [et_length, encoder_size]

##### Evaluate

$r_s=f(W_2f(W_1e_s+b_1)+b_2)$

$r_t=f(W_2f(W_1e_t'+b_1)+b_2)$

##### learning

Comparative Translation

cosine 相似度越大越接近，所以 -logP 越小越好。这里面涉及到的参数 $\theta_{enc}, \theta_R$

Basically, the translation model is trying to minimize the relative distance of the translated sentence t* to the source sentence s compared to the top-k extracted-and-edited sentences in the target language space. Intuitively, we view the top-k extracted-and-edited sentences as the anchor points to locate a probable region in the target language space, and iteratively improve the source-to-target mapping via the comparative learning scheme.

Adversarial Objective > we can view our translation system as a “generator” that learns to generate a good translation with a higher similarity score than the extracted-and-edited sentences, and the evaluation network R as a “discriminator” that learns to rank the extracted- and-edited sentences (real sentences in the target language space) higher than the translated sentences.

##### Model selection

Basically, we choose the hyper-parameters with the maximum expectation of the ranking scores of all translated sentences.

### Implementation details

#### Initialization

cross-lingual BPE embedding, set BPE number 60000.

#### Model structure

all encoder parameters are shared across two languages. Similarly, we share all decoder parameters across two languages.

The λ for calculating ranking scores is 0.5. As for the evaluation network R, we use a multilayer perceptron with two hidden layers of size 512.