# character embedding

## Motivation

A language model is formalized as a probability distribution over a sequence of strings (words), and traditional methods usually involve making an n-th order Markov assumption and estimating n-gram probabilities via counting and subsequent smoothing (Chen and Goodman 1998). The count-based models are simple to train, but probabilities of rare n-grams can be poorly estimated due to data sparsity (despite smoothing techniques).

While NLMs have been shown to outperform count-based n-gram language models (Mikolov et al. 2011), they are blind to subword information (e.g. morphemes). For example, they do not know, a priori, that eventful, eventfully, uneventful, and uneventfully should have structurally related embeddings in the vector space. Embeddings of rare words can thus be poorly estimated, leading to high perplexities for rare words (and words surrounding them). This is especially problematic in morphologically rich languages with long-tailed frequency distributions or domains with dynamic vocabularies (e.g. social media).

neural language models 将词嵌入到低维的向量中，使得语义相似的词在向量空间的位置也是相近的。然后 Mikolov word2vec 这种方式不能有效的解决子单词的信息问题，比如一个单词的各种形态，也不能认识前缀。这种情况下，不可避免的会造成不常见词的向量表示估计很差，对于不常见词会有较高的困惑度。这对于词语形态很丰富的语言是一个难题，同样这种问题也是动态词表的问题所在（比如社交媒体）。

## Recurrent Neural Network Language Model

$$Pr(w_{t+1}=j|w_{1:t})=\dfrac{exp(h_t\cdot p^j+q^j)}{\sum_{j’\in V}exp(h_t\cdot p^{j’}+q^{j’})}$$

$$NLL=-\sum_{T}^{t=1}logPr(w_t|w_{1:t-1})$$

## Chracter-level Convolution Neural Network

$$f^k[i]=tanh(<C^k[* ,i:i-w+1], H> +b)$$

<>表示做卷积运算(Frobenius inner product). 然后加上 bias 和 非线性激活函数 tanh.

## Highway Network

Highway Network 分为两层 layer.

• one layer of an MLP applies an affine transformation:

$$z=g(W_y+b)$$

• one layer 有点类似 LSTM 中的 gate 机制：

$$z=t\circ g(W_Hy+b_H)+(1-t)\circ y$$

# ELMo

## ELMo

ELMo is a task specific combination of the intermediate layer representations in the biLM.

ELMo 实际上只是下游任务的中间层，跟 BERT 一样。但也有不同的是， ELMo 每一层的向量表示会获得不同的 信息。底层更能捕捉 syntax and semantics 信息，更适用于 part-of-speech tagging 任务，高层更能获得 contextual 信息，更适用于 word sense disambiguation 任务。所以对不同的任务，会对不同层的向量表示的利用不同。

## Model architecture

The final model uses L = 2 biLSTM layers with 4096 units and 512 dimension projections and a residual connection from the first to second

layer. The context insensitive type representation uses 2048 character n-gram convolutional filters followed by two highway layers and a linear projection down to a 512 representation.

Xie Pan

2018-09-24

2021-06-29