cs224d-lecture11 再看GRU和NMT

主要内容:

  • GRU进一步理解
  • GRU和LSTM对比
  • LSTM的进一步理解
  • 训练RNN的一些tips
  • Ensemble
  • MT Evaluation
  • 生成词使用softmax导致的计算量过大的问题
  • presentation

GRU 进一步理解

shortcut connection

adaptive shortcut connection 使用update gate 自适应的增加shortcut connection

prune unnecessary connections adaptively 使用reset gate自适应的修剪不必要的连接。

突然想到个问题,为什么神经网络具有自适应性?我个人的理解是,神经网络是一个参数学习和拟合的过程,在梯度下降的过程中,模型得到优化使其具有自适应性。

question1:

how you select the readable subset based on this reset gate? \[r_t=\sigma(W^{(r)}x_t+U^{(r)}h_{t-1})\tag{reset gate}\] \[\tilde h_t=tanh(Wx_t+r_t\circ Uh_{t-1})\tag{new memory}\] the reset gate decides which parts of the hidden state to read to update the hidden state. So, the reset gate calculates which parts to read based on the current input and the previous hidden state. So it's gonna say, okay, I wanna pay a lot of attention to dimensions 7 and 52. And so, those are the ones and a little to others. And so those are the ones that will be being read here and used in the calculation of the new candidate update, which is then sort of mixed together with carrying on what you had before.

对此,Christopher老头儿还举了个例子,在隐藏状态中动词保存在47-52 dimensions,当遇到新的verb是,隐藏状态的这部分维度就会得到更新。看到这,真想试试打印出 \(r_t\) 看看它随时间步的变化情况。。

question2:

how you select the writable subset based on this update gate? \[u_t=\sigma(W^{(z)}x_t+U^{(z)}h_{t-1})\tag{update gate}\] \[h_t=(1-u_t)\circ \tilde h_t+u_t\circ h_{t-1} \tag{Hidden state}\]

some of the hidden state we're just gonna carry on from the past. We're only now going to edit part of the register. And saying part of the register, I guess is a lying and simplifying a bit, because really, you've got this vector of real numbers and some said the part of the register is 70% updating this dimension and 20% updating this dimension that values could be one or zero but normally they won't be. So I choose the writable subset And then it's that part of it that I'm then updating with my new candidate update which is then written back, adding on to it. And so both of those concepts in the gating, the one gate is selecting what to read for your candidate update. And the other gate is saying, which parts of the hidden state to overwrite?

感觉意思是,update gate主要是为了控制生成当前时间步的隐藏状态 \(h_t\),如果更新门的值都是1,那就以为着保存所有的以前的信息。

question3:

how does these gates avoid gradient vanishing?

\[h_t=f(h_{t-1},x_t)=u_t \circ \tilde h_t + (1-u_t)\circ h_{t-1}\] the secret is this plus sign.

在回过头来看一下梯度消失的那个公式:

\[\dfrac{\partial E_t}{\partial W} = \sum_{k=1}^t\dfrac{\partial E_t}{\partial y_t}\dfrac{\partial y_t}{\partial h_t}\dfrac{\partial h_t}{\partial h_k}\dfrac{\partial h_k}{\partial W}\]

举个例子,我们要求t=3时刻的损失函数对 \(W_{hh}\) 的导数,那么: \[\begin{align} \dfrac{\partial E_3}{\partial W} &=\sum_{k=1}^3\dfrac{\partial E_3}{\partial y_3}\dfrac{\partial y_3}{\partial h_3}\dfrac{\partial h_3}{\partial h_k}\dfrac{\partial h_k}{\partial W}\\ &=\dfrac{\partial E_3}{\partial y_3}\dfrac{\partial y_3}{\partial h_3}(\dfrac{\partial h_3}{\partial W}+\dfrac{\partial h_3}{\partial h_2}\dfrac{\partial h_2}{\partial W}+\dfrac{\partial h_3}{\partial h_2}\dfrac{\partial h_2}{\partial h_1}\dfrac{\partial h_1}{\partial W}) \end{align}\]

可以看到很早之前的隐藏状态 \(h_1\) 随着时间的增长,对当前时刻的影响越来越小。而在GRU中,当update gate \(u_t=0\) 时,\(h_3=h_2\),这说明之前的隐藏状态存储的信息能有效的传递下来, that's why it can carry information for a very long time.。

question4:

how long does a GRU actually end up remembering for?

Answer: I kind of think order of magnitude the kind number you want in your head is 100 steps. So they don't remember forever I think that's something people also get wrong.

question5:

Does GRU train faster than lstm?

Answer: LSTMs have a slight edge on speed. No huge difference.

GRU和LSTM的区别

question6:

LSTMs中为什么 \(h_t=o_t\circ tanh(c_t)\) 中要用到tanh?

TA Richard的解释是,对于 new memory cell \(\tilde c = f_t\circ c_{t-1}+i_t\circ \tilde c_t\) 这是一个线性的layer,加上tanh非线性因素,能让lstm更powerful.

LSTM 直观图解

可以说是很清楚了~~不过这里有点区别,将 \(h_{t-1}\)\(x_t\) concat在一起了,比如三个gate:

\[i_t = \sigma (W_i[h_{t-1},x_t]+b_i)\tag{input/update gate}\]

\[o_t = \sigma (W_o[h_{t-1},x_t]+b_o)\tag{output gate}\]

\[f_t = \sigma (W_f[h_{t-1},x_t]+b_f)\tag{forget gate}\]

而更新的 new memory cell \(\tilde c_t\): \[\tilde c_t=\tanh(W_c[h_{t-1}, x_t]+b_c)\]

最终的记忆细胞状态 \(c_t\):

\[c_t= f_t\circ c_{t-1}+i_t\circ \tilde c_t\]

最终的隐藏状态 \(h_t\): \[h_t=o_t\circ tanh(c_t)\]

LSTM的核心,类似于resnet:

用加和,也就是图中的plus sign,原本的rnn的仅有matrix multiply,使得网络具有 long dependency.

训练rnn的一些经验

第7点,万万不能dropout horizontally,那样会丢失很多信息。

Ensemble

之前看到在cnn里面,dropout可以看做是很多模型的集成,不知道rnn是否也可以。

MT Evaluation

关于机器翻译的模型验证 notoriously tricky and subjective task,臭名昭著的棘手以及非常具有主观性。 BLEU: a Method for Automatic Evaluation of Machine Translation

原理: n-gram matches

\(p_n\) = # matched n-grams / # n-grams in candidate translation

其实就是 precision

\(p_n\) 表示 n-gram 的precision score. 并且,使用 \(w_n=1/2^n\) 作为对应的权重。

brevity penalty:短译句容易得高分,因此需要给予惩罚

\[BP=\begin{cases} 1, & \text{if c > r}\\ e^{1-r/c}, & \text{if c $\le r$} \end{cases}\]

BLEU:

\[BLEU=BP\cdot exp(\sum_{n=1}^Nw_nlogp_n)\]

在对数域:

\[log BLEU=min(1-\dfrac{r}{c},0)+\sum_{n=1}^Nw_nlogp_n\] ### 又是softmax的问题 在每个时间步,从隐藏状态到词表,使用softmax这一步非常消耗计算力。

解决方法:

Not GPU-friendly! 不知道为啥。。感觉需要好好了解下GPU

原论文: On Using Very Large Target Vocabulary for Neural Machine Translation

Word and character-based models

???

presentation