## Motivation

### 对比之前的几种模型

#### concatenate embeddings: ELMo

Recent approaches that concatenate embeddings derived from other tasks with the input at different layers (Peters et al., 2017; McCann et al., 2017; Peters et al., 2018) still train the main task model from scratch and treat pretrained embeddings as fixed parameters, limiting their usefulness.

ELMo有以下几个步骤：
- 利用LM任务进行预训练
- 再利用目标领域的语料对LM模型做微调
- 最后针对目标任务进行 concatenate embedding，然后训练模型

#### pretraining LM:

In light of the benefits of pretraining (Erhan et al., 2010), we should be able to do better than randomly initializing the remaining parameters of our models. However, inductive transfer via finetuning has been unsuccessful for NLP (Mou et al., 2016). Dai and Le (2015) first proposed finetuning a language model (LM) but require millions of in-domain documents to achieve good performance, which severely limits its applicability.

#### ULMFiT

We show that not the idea of LM fine-tuning but our lack of knowledge of how to train them effectively has been hindering wider adoption. LMs overfit to small datasets and suffered catastrophic forgetting when fine-tuned with a classifier. Compared to CV, NLP models are typically more shallow and thus require different fine-tuning methods.

- 通用的语言模型微调
- discriminative fine-tuning, slanted triangular learning rates

## Universal Language Model Fine-tuning

- General-domain LM pretraining

### General-domain LM pretraining

Wikitext-103 (Merity et al., 2017b) consisting of 28,595 preprocessed Wikipedia articles and 103 million words.

#### discriminative fine-tunin

As different layers capture different types of information (Yosinski et al., 2014), they should be fine-tuned to different extents.

Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates.

#### Slanted triangular learning rates

• T 是迭代次数，这里实际上是 $epochs \times \text{number of per epoch}$
• cut_frac 是增加学习率的迭代步数比例
• cut 是学习率增加和减少的临界迭代步数
• p 是一个分段函数，分别递增和递减
• ratio 表示学习率最小时，与最大学习率的比例。比如 t=0时，p=0, 那么 $\eta_0=\dfrac{\eta_{max}}{ratio}$

#### concat pooling

> We first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we finetune all layers until convergence at the last iteration.

#### BPTT for Text Classification

backpropagation through time(BPTT)

We divide the document into fixed length batches of size b. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences (Merity et al., 2017a).

## experiment

### ablations

"from scratch": 没有 fine-tune
"supervised": 表示仅仅在 label examples 进行 fine-tune
"semi-supervised": 表示在 unable examples 上也进行了 fine-tune

### 对 tricks 进行分析

"full" :fine-tuning the full model
"discr": discriminative fine-tuning
"stlr": slanted triangular learning rates