Recent approaches that concatenate embeddings derived from other tasks with the input at different layers (Peters et al., 2017; McCann et al., 2017; Peters et al., 2018) still train the main task model from scratch and treat pretrained embeddings as fixed parameters, limiting their usefulness.
这篇 paper 是在 elmo 之后,而 elmo 虽然相对出名,影响力更大,但是 elmo 仍旧只是一种 word embedding 的预训练,在下游任务中还是需要从头训练模型。
ELMo有以下几个步骤:
利用LM任务进行预训练
再利用目标领域的语料对LM模型做微调
最后针对目标任务进行 concatenate embedding,然后训练模型
pretraining LM:
In light of the benefits of pretraining (Erhan et al., 2010), we should be able to do better than randomly initializing the remaining parameters of our models. However, inductive transfer via finetuning has been unsuccessful for NLP (Mou et al., 2016). Dai and Le (2015) first proposed finetuning a language model (LM) but require millions of in-domain documents to achieve good performance, which severely limits its applicability.
We show that not the idea of LM fine-tuning but our lack of knowledge of how to train them effectively has been hindering wider adoption. LMs overfit to small datasets and suffered catastrophic forgetting when fine-tuned with a classifier. Compared to CV, NLP models are typically more shallow and thus require different fine-tuning methods.
Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates.
T 是迭代次数,这里实际上是 $epochs \times \text{number of per epoch}$
cut_frac 是增加学习率的迭代步数比例
cut 是学习率增加和减少的临界迭代步数
p 是一个分段函数,分别递增和递减
ratio 表示学习率最小时,与最大学习率的比例。比如 t=0时,p=0, 那么 $\eta_0=\dfrac{\eta_{max}}{ratio}$
作者通过实验发现,cut_frac=0.1, ratio=32, $\eta_max=0.01$
Target task classifier fine-tuning
针对分类任务,加上 two additional linear blocks.
concat pooling
gradul unfreezing
逐渐 unfreeze layers:
We first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we finetune all layers until convergence at the last iteration.
BPTT for Text Classification
backpropagation through time(BPTT)
We divide the document into fixed length batches of size b. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences (Merity et al., 2017a).
对于 GPT 如果它使用双向,那么模型就能准确的学到到句子中的下一个词是什么,并能 100% 的预测出下一个词。比如 “I love to work on NLP.” 在预测 love 的下一个词时,模型能看到 to,所以能很快的通过迭代学习到 “to” 100% 就是 love 的下一个词。所以,这导致模型并不能学到想要的东西(句法、语义信息)。
那么 BERT 是怎么处理双向这个问题的呢? 它改变了训练语言模型的任务形式。提出了两种方式 “masked language model” and “next sentence generation”. 再介绍这两种训练方式之前,先说明下输入形式。
If the model had been trained on only predicting ‘<MASK>’ tokens and then never saw this token during fine-tuning, it would have thought that there was no need to predict anything and this would have hampered performance. Furthermore, the model would have only learned a contextual representation of the ‘<MASK>’ token and this would have made it learn slowly (since only 15% of the input tokens are masked). By sometimes asking it to predict a word in a position that did not have a ‘<MASK>’ token, the model needed to learn a contextual representation of all the words in the input sentence, just in case it was asked to predict them afterwards.
Well, ideally we want the model’s representation of the masked token to be better than random. By sometimes keeping the sentence intact (while still asking the model to predict the chosen token) the authors biased the model to learn a meaningful representation of the masked tokens.
A language model is formalized as a probability distribution over a sequence of strings (words), and traditional methods usually involve making an n-th order Markov assumption and estimating n-gram probabilities via counting and subsequent smoothing (Chen and Goodman 1998). The count-based models are simple to train, but probabilities of rare n-grams can be poorly estimated due to data sparsity (despite smoothing techniques).
While NLMs have been shown to outperform count-based n-gram language models (Mikolov et al. 2011), they are blind to subword information (e.g. morphemes). For example, they do not know, a priori, that eventful, eventfully, uneventful, and uneventfully should have structurally related embeddings in the vector space. Embeddings of rare words can thus be poorly estimated, leading to high perplexities for rare words (and words surrounding them). This is especially problematic in morphologically rich languages with long-tailed frequency distributions or domains with dynamic vocabularies (e.g. social media).
neural language models 将词嵌入到低维的向量中,使得语义相似的词在向量空间的位置也是相近的。然后 Mikolov word2vec 这种方式不能有效的解决子单词的信息问题,比如一个单词的各种形态,也不能认识前缀。这种情况下,不可避免的会造成不常见词的向量表示估计很差,对于不常见词会有较高的困惑度。这对于词语形态很丰富的语言是一个难题,同样这种问题也是动态词表的问题所在(比如社交媒体)。
The final model uses L = 2 biLSTM layers with 4096 units and 512 dimension projections and a residual connection from the first to second
layer. The context insensitive type representation uses 2048 character n-gram convolutional filters followed by two highway layers and a linear projection down to a 512 representation.