论文笔记-预训练语言模型2-ULMFiT

Motivation

对比之前的几种模型

concatenate embeddings: ELMo

Recent approaches that concatenate embeddings derived from other tasks with the input at different layers (Peters et al., 2017; McCann et al., 2017; Peters et al., 2018) still train the main task model from scratch and treat pretrained embeddings as fixed parameters, limiting their usefulness.
这篇 paper 是在 elmo 之后,而 elmo 虽然相对出名,影响力更大,但是 elmo 仍旧只是一种 word embedding 的预训练,在下游任务中还是需要从头训练模型。

ELMo有以下几个步骤:
- 利用LM任务进行预训练
- 再利用目标领域的语料对LM模型做微调
- 最后针对目标任务进行 concatenate embedding,然后训练模型

pretraining LM:

In light of the benefits of pretraining (Erhan et al., 2010), we should be able to do better than randomly initializing the remaining parameters of our models. However, inductive transfer via finetuning has been unsuccessful for NLP (Mou et al., 2016). Dai and Le (2015) first proposed finetuning a language model (LM) but require millions of in-domain documents to achieve good performance, which severely limits its applicability.
直接使用在 general-domain 上预训练好的语言模型,然后通过 fine-tune 进行迁移学习, 仍旧需要大量的 in-domain 的文档才能获得比较好的 performance.

 ULMFiT

We show that not the idea of LM fine-tuning but our lack of knowledge of how to train them effectively has been hindering wider adoption. LMs overfit to small datasets and suffered catastrophic forgetting when fine-tuned with a classifier. Compared to CV, NLP models are typically more shallow and thus require different fine-tuning methods.
作者认为,预训练语言模型的方式并不是不好,只是训练方法的问题导致了他们表现局限性。想对于 CV, NLP 中的很多任务所需要的语义更浅层。而将 LMs 在小数据集上 fine-tune 时会导致严重的遗忘。

于是,作者提出了 Universal Language Model Fine-tuning(ULMFiT)
- 通用的语言模型微调
- discriminative fine-tuning, slanted triangular learning rates
- gradual unfreezing

Universal Language Model Fine-tuning

主要分为 3 部分:
- General-domain LM pretraining
- Target task LM fine-tuning
- Target task classifier fine-tuning

General-domain LM pretraining

Wikitext-103 (Merity et al., 2017b) consisting of 28,595 preprocessed Wikipedia articles and 103 million words.
在足够大的 general-domain 语料库上进行预训练。

Target task LM fine-tuning

discriminative fine-tunin

在目标语料库 in-domain 上进行 fine-tune. 这部分会收敛的很快,并且在小数据集上依旧会有很好的泛化性。

As different layers capture different types of information (Yosinski et al., 2014), they should be fine-tuned to different extents.

不同的 layer 能捕捉不同程度的信息,于是,作者提出了 discriminative fine-tuning. 不同的 layer 具有不同的 learning rate. L 表示总的 layer 数目。 \[\{\theta^1,\theta^2, ..., \theta^L\}\] \[\{\eta^1,\eta^2, ..., \eta^L\}\]

Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates.

原本的 SGD 是这样的: \[\theta_t = \theta_{t-1}-\eta\cdot\nabla_{\theta}J(\theta)\]

改进之后: \[\theta_t^l = \theta_{t-1}^l-\eta^l\cdot\nabla_{\theta^l}J(\theta)\]

作者通过经验发现:先选择最后一层的学习率 \(\eta^L\),然后计算每一层的学习率 \(\eta^{l-1}=\eta^l/2.6\)

Slanted triangular learning rates

  • T 是迭代次数,这里实际上是 \(epochs \times \text{number of per epoch}\)
  • cut_frac 是增加学习率的迭代步数比例
  • cut 是学习率增加和减少的临界迭代步数
  • p 是一个分段函数,分别递增和递减
  • ratio 表示学习率最小时,与最大学习率的比例。比如 t=0时,p=0, 那么 \(\eta_0=\dfrac{\eta_{max}}{ratio}\)

作者通过实验发现,cut_frac=0.1, ratio=32, \(\eta_max=0.01\)

Target task classifier fine-tuning

针对分类任务,加上 two additional linear blocks.

concat pooling

gradul unfreezing

逐渐 unfreeze layers:
> We first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we finetune all layers until convergence at the last iteration.

BPTT for Text Classification

backpropagation through time(BPTT)

We divide the document into fixed length batches of size b. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences (Merity et al., 2017a).

什么意思?并不是一个 batch 更新一次梯度,而是累加一定的 batch 之后在更新梯度?
能增加泛化性?

Bidirectional language model

独立的对 forward-LM, backward-LM 进行 fine-tune, 然后平均。

experiment

与其他模型对比

ablations

"from scratch": 没有 fine-tune
"supervised": 表示仅仅在 label examples 进行 fine-tune
"semi-supervised": 表示在 unable examples 上也进行了 fine-tune

对 tricks 进行分析

"full" :fine-tuning the full model
"discr": discriminative fine-tuning
"stlr": slanted triangular learning rates