# chapter4:语言模型和N元语法

• N-gram: Markov chain是其中一个特例

• 验证N-gram模型： 困惑度

• 预处理：泛化和zeros情况

• smoothing：用来处理zeros～

• 困惑度和交叉熵

### N-grams

$$P(w_1…w_n) = P(w_1^n) = P(w_1)P(w_2|w_1)P(w_3|w_1^2)…P(w_n|w_1^{n-1})$$

$$P(w_1…w_n) = \prod_{k=1}^nP(w_k|w_1^{k-1})$$

$$P(w_n|w_1^{n-1}) = P(w_n|w_{n-1})$$

$$P(w_n|w_{n-1}) = \dfrac{C(w_{n-1}w_n)}{C(w_{n-1})}$$

$$P(w_n|w_{n-N+1}^{n-1}) = \dfrac{C(w_{n-N+1}^{n-1}w_n)}{C(w_{n-N+1}^{n-1})}$$

Some pratical issues:

• 通常使用 trigran，4-gram,甚至是5-gram效果要比bigram更好。但这需要更大的语料库。

• 由于相对概率都是大于0小于1的,概率相乘之后会更小，可能会导致数值下溢(numerical underflow)。因此采用对数概率(log probabilities).

### Evaluating Language Models

development test set or devset: fresh test set

often: 80% training, 10% development, 10% test.

#### Generalization and Zeros

The N-gram model, like many statistical models, is dependent on the training corpus. One implication of this is that the probabilities often encode specific facts about a given training corpus. Another implication is that N-grams do a better and better job of modeling the training corpus as we increase the value of N.

##### the value of N

• 在training data中，对语料库中的句子加上<\s>（不需要加 < s >）,这里并没有训练，只是计算了对应的unigram, bigram, trigram的概率.

• 然后在test data上进行验证，来判断哪个模型好，那个坏？这是内在评估

• 然后外在评估就是每一次random一个词，在这个基础上根据概率random下一个词，直到生成<\s>.

##### the corpus

How should we deal with this problem when we build N-gram models? One way is to be sure to use a training corpus that has a similar genre to whatever task we are trying to accomplish. To build a language model for translating legal documents, we need a training corpus of legal documents. To build a language model for a question-answering system, we need a training corpus of questions.

##### Unknow Words

closed vocabulary:测试集中的词也都是来自于词库的(lexicon),比如语音识别和机器翻译，并不会出现unknown词。

1. 第一种就是把问题转换为 closed vocabulary，也就是提前建立一一个词汇表

1. 第二种方法还好理解一点，就是在train data中把出现字数较少的词当做，那么这个时候就像普通的词一样也是有它的概率的。但有一个缺点：

it takes some time to be convinced.

### Smoothing 平滑

add-k smoothing, Stupid backoff, and Kneser-Ney smoothing.**

#### Laplace Smoothing 拉普拉斯平滑

$$P(want|I) = \dfrac{C(want|I)}{C(I)}\le \dfrac{C(want|I)+1}{C(I)+V}$$

$$P_{add-k}(w_i|w_{i-n+1}\cdots w_{i-1}) = \frac{C(w_{i-n+1}\cdots w_i)+k}{C(w_{i-n+1}\cdots w_{i-1})+k|V|}$$

#### Backoff and Interpolation 回退和插值

##### 回退

$$\hat P(w_i|w_{i-2}w_{i-1}) =\begin{cases}P(w_i|w_{i-2}w_{i-1})&,if\quad C(w_{i-2}w_{i-1})>0\ \alpha_1P(w_i|w_{i-1})&,if\quad C(w_{i-1})>0\ \alpha_2P(w_i)&, otherwise \end{cases}$$

$$P_BO(w_n|w_{n-N+1}^{n-1}) =\begin{cases} \tilde P(w_n|w_{n-N+1}^{n-1})&, if\quad C(w_{n-N+1 \cdots w_{n-1} }>0)\ \theta(P(w_n|w_{n-N+1}^{n-1}))\alpha P_BO(w_n|w_{n-N+2}^{n-1})&,otherwise\end{cases}$$

$$\theta(x) =\begin{cases}1&,if\quad x=0\ 0&,otherwise\end{cases}$$

##### 插值

$$\hat P(w_n|w_{n-1}w_{n-1})=\lambda_1P(w_n|w_{n-2}w_{n-1})+\lambda_2P(w_n|w_{n-1})+\lambda_3P(w_n)$$

$\lambda_i$ 可以根据试验凭经验设定，也可以通过应用某些算法确定，例如EM算法。

$$\hat P(w_n|w_{n-1}w_{n-1})=\lambda_1(w_{n-2}w_{n-1})P(w_n|w_{n-2}w_{n-1})+\lambda_2(w_{n-2}w_{n-1})P(w_n|w_{n-1})+\lambda_3(w_{n-2}w_{n-1})P(w_n)$$

### absolute discounting

$$P_{AbsDiscount}(w_i|w_{i-1})=\frac{C(w_{i-1}w_i)-d}{C(w_{i-1})}+\lambda(w_{i-1})P(w_i)$$

### Kneser-Ney discounting

Kneser-Ney discounting是在absolute discounting的基础上，对低阶gram进行一些处理～

I can’t see without my reading _______.

$$P_{continuation}(w_i)\propto \lvert {w_{i-1}:C(w_{i-1}w_i)>0}\rvert$$

$$P_{continuation}(w_i)=\frac{\lvert {w_{i-1}:C(w_{i-1}w_i)>0}\rvert}{\lvert{(w_{j-1},w_j):C(w_{j-1}w_j)>0}\rvert}$$

$$P_{continuation}(w_i)=\frac{\lvert { w_{i-1}:C(w_{i-1}w_i)>0}\rvert} {\sum_{w’_ {i-1}} \lvert{w’_ {i-1}:C(w’_ {i-1}w’_ i)>0}\rvert}$$

$$P_{KN}(w_i|w_{i-1})=\frac{max(C(w_{i-1}w_i)-d,0)}{C(w_{i-1})}+\lambda(w_{i-1})P_{continuation}(w_i)\tag{4.33}$$

$$\lambda(w_{i-1})=\frac{d}{C(w_{i-1})}\cdot \lvert {w:C(w_{i-1},w)>0}\rvert\tag{4.34}$$

（4.34）第一项 $\dfrac{d}{C(w_{i-1})}$ 是折扣归一化（normalized discount）；第二项 $\lvert {w:C(w_{i-1},w)>0}\rvert$ 是打折的次数。

$$P_{KN}(w_i|w_{i-n+1}\cdots w_{i-1})=\frac{max(0,C_{KN}(w_{i-n+1} \cdots w_i) - d)}{C_{KN}(w_{i-n+1}\cdots w_{i-1})} +\lambda(w_{i-n+1}\cdots w_{i-1})\cdot P_{KN}(w_i|w_{i-n+2}\cdots w_{i-1})$$

$$C_{KN}(\cdot)=\begin{cases}count(\cdot) &,for\ the\ highest\ order\ continuationcount(\cdot) &,for\ all\ other\ lower\ orders \end{cases}$$

### Advanced: Perplexity’s Relation to Entropy

#### 熵 entropy

$$H(X)=-\sum p(x)logp(x)$$

$$H(w_1,w_2,…,w_n)=-log_{w_1^n\in L}p(w_1^n)logp(w_1^n)$$

entropy rate, 也可以看作是per-word entropy:

$$\dfrac{1}{n}H(w_1,w_2,…,w_n)=-\dfrac{1}{n}log_{w_1^n\in L}p(w_1^n)logp(w_1^n)$$

#### 交叉熵 cross-entropy

$$H(L)=-lim_{n\to \infty}\dfrac{1}{n}H(w_1,w_2,…,w_n)=-lim_{n\to \infty}\sum_{W\in L}p(w_1,w_2,…,w_n)logp(w_1,w_2,…,w_n)$$

cross-entropy: 但我们不知道生成数据的真实分布p时，可以用分布m来表示：

$$H(p,m)=lim_{n\to \infty}-\dfrac{1}{n}\sum_{W\in L}p(w_1,w_2,…,w_n)logm(w_1,w_2,…,w_n)$$

$$H(p,m)=lim_{n\to \infty}-\dfrac{1}{n}\sum_{W\in L}logm(w_1,w_2,…,w_n)\tag{4.47}$$

##### 困惑度和交叉熵

N-gram模型 $M=P(w_i|w_{i-N+1} \cdots w_{i-1})$ 生成序列sequences W, 根据公式(4.47):

$$H(W) = -\dfrac{1}{N}logP(w_1w_2 \cdots w_N)$$

### 总结

###参考：

• [1] Speech and Language Processing. Daniel Jurafsky & James H. Martin, 3rd. Chapter 4

Xie Pan

2018-04-12

2021-06-29