# cs224d-lecture8-RNN

• 语言模型 Language model

• 循环神经网络 recurrent neural network

• 梯度消失和梯度爆炸问题的原因以及解决方法

• 双向rnn， deep bi-RNNs

• 关于依存分析的presentation

### 语言模型 Language Model

$$P(w_1,…,w_m)=\prod_{i=1}^{i=m}P(w_i|w_1,…,w_i-1)\approx\prod_{i=1}^{i=m}P(w_i|w_{i-n},…,w_{i-1})$$

$$P(w_2|w_1)=\dfrac{count(w_1,w_2)}{count(w_1)}$$

$$P(w_3|w_1,w_2)=\dfrac{count(w_1,w_2,w_3)}{count(w_1,w_2)}$$

For instance, consider a case where an article discusses the history of Spain and France and somewhere later in the text, it reads “The two countries went on a battle”; clearly the information presented in this sentence alone is not sufficient to identify the name of the two countries.

$$\hat y=softmax(W^{(2)}tanh(w^{(1)}x+b^{(1)})+w^{(3)}x+b^{(3)})$$

• W^{(1)} 应用于词向量（solid green arrows）

• W^{(2)} 应用于隐藏层

• W^{(3)} 应用于词向量（dashed green arrows）

### 循环神经网络 Recurrent Neural Network language model

$$h_t = \sigma(W_{hh}h_{t-1}+W_{hx}x_{|t|})$$

shapes:

• $h_0\in R^{D_h}$ is some initialization vector for the hidden layer at time step 0,
• $x\in R^{d}$ is the column vector for L at index [t] at time step t
• $W^{hh}\in R^{D_h\times D_h}$
• $W^{hx}\in R^{D_h\times d}$
• $W^{(S))}\in R^{|V|\times D_h}$

$\hat y \in R^{|V|}$ 通过softmax得到的在词表V上的概率分布。

$$J^{(t)}(\theta) = -\sum_{j=1}^{|V|}y_{t,j}log\hat y_{t,j}$$

$y_{t,j}$ 表示当前时间步的actual word,是 one-hot vector.

$$J=\dfrac{1}{T}\sum_{t=1}^T(\theta)=-\dfrac{1}{T}\sum_{t=1}^T\sum_{j=1}^{|V|}y_{t,j}\times log(\hat y_{t,j})$$

$$Perplexity=2^J$$

#### 梯度消失和梯度爆炸问题

Training RNNs is incredibly hard! Buz of gradient vanishing and explosion problems.

$$h_t=\sigma (Wf(h_{t-1})+W^{(hx)}x_{|t|})$$

$$\hat y = softmax(W^{(S)}f(h_t))$$

$$\dfrac{\partial E_t}{\partial W} = \sum_{k=1}^t\dfrac{\partial E_t}{\partial y_t}\dfrac{\partial y_t}{\partial h_t}\dfrac{\partial h_t}{\partial h_k}\dfrac{\partial h_k}{\partial W}$$

$$\dfrac{\partial h_t}{\partial h_k} = \prod_{j=k+1}^k\dfrac{\partial h_j}{\partial h_{j-1}}=\prod_{j=k+1}^tW^T\times diag[f’(j_{j-1})]$$

$$\dfrac{\partial E_t}{\partial W} = \sum_{k=1}^t\dfrac{\partial E_t}{\partial y_t}\dfrac{\partial y_t}{\partial h_t}(\prod_{j=k+1}^t\dfrac{\partial h_j}{\partial h_{j-1}})\dfrac{\partial h_k}{\partial W}$$

### 解决梯度爆炸或消失的一些tricks

• 实线Solid lines表示 standard gradient descent trajectories

• 虚线Dashed lines表示 gradients rescaled to fixed size

• 参数初始化 Initialization

• relus, Rectified Relus

### 序列模型的一些其他任务

Classify each word into:

#### 双向 RNNs

Deep bidirectional RNNs

$$\overrightarrow {h_t^{(i)}}=f(\overrightarrow{W^{(i)}}h_t^{(i-1)}+\overrightarrow{V^{(i)}}h_{t-1}^{(i)}+\overrightarrow{b^{(i)}})$$

$$\overleftarrow {h_t^{(i)}}=f(\overleftarrow{W^{(i)}}h_t^{(i-1)}+\overleftarrow{V^{(i)}}h_{t-1}^{(i)}+\overleftarrow{b^{(i)}})$$

$$\hat y_t=g(Uh_t+c)=g(U[\overrightarrow{h_t^{(L)}};\overleftarrow{h_t^{(L)}}]+c)$$

### Presentation

Structured Training for Neural Network Transition-Based Parsing, David Weiss, Chris Alberti, Michael Collins, Slav Petrov

Xie Pan

2018-05-04

2021-06-29