• GRU进一步理解
• GRU和LSTM对比
• LSTM的进一步理解
• 训练RNN的一些tips
• Ensemble
• MT Evaluation
• 生成词使用softmax导致的计算量过大的问题
• presentation

### GRU 进一步理解

shortcut connection adaptive shortcut connection 使用update gate 自适应的增加shortcut connection

prune unnecessary connections adaptively 使用reset gate自适应的修剪不必要的连接。  #### question1:

how you select the readable subset based on this reset gate? $r_t=\sigma(W^{(r)}x_t+U^{(r)}h_{t-1})\tag{reset gate}$ $\tilde h_t=tanh(Wx_t+r_t\circ Uh_{t-1})\tag{new memory}$ the reset gate decides which parts of the hidden state to read to update the hidden state. So, the reset gate calculates which parts to read based on the current input and the previous hidden state. So it's gonna say, okay, I wanna pay a lot of attention to dimensions 7 and 52. And so, those are the ones and a little to others. And so those are the ones that will be being read here and used in the calculation of the new candidate update, which is then sort of mixed together with carrying on what you had before.

#### question2:

how you select the writable subset based on this update gate? $u_t=\sigma(W^{(z)}x_t+U^{(z)}h_{t-1})\tag{update gate}$ $h_t=(1-u_t)\circ \tilde h_t+u_t\circ h_{t-1} \tag{Hidden state}$

some of the hidden state we're just gonna carry on from the past. We're only now going to edit part of the register. And saying part of the register, I guess is a lying and simplifying a bit, because really, you've got this vector of real numbers and some said the part of the register is 70% updating this dimension and 20% updating this dimension that values could be one or zero but normally they won't be. So I choose the writable subset And then it's that part of it that I'm then updating with my new candidate update which is then written back, adding on to it. And so both of those concepts in the gating, the one gate is selecting what to read for your candidate update. And the other gate is saying, which parts of the hidden state to overwrite?

#### question3:

how does these gates avoid gradient vanishing?

$h_t=f(h_{t-1},x_t)=u_t \circ \tilde h_t + (1-u_t)\circ h_{t-1}$ the secret is this plus sign.

$\dfrac{\partial E_t}{\partial W} = \sum_{k=1}^t\dfrac{\partial E_t}{\partial y_t}\dfrac{\partial y_t}{\partial h_t}\dfrac{\partial h_t}{\partial h_k}\dfrac{\partial h_k}{\partial W}$

#### question4:

how long does a GRU actually end up remembering for?

Answer: I kind of think order of magnitude the kind number you want in your head is 100 steps. So they don't remember forever I think that's something people also get wrong.

#### question5:

Does GRU train faster than lstm?

Answer: LSTMs have a slight edge on speed. No huge difference.

### GRU和LSTM的区别 #### question6:

LSTMs中为什么 $h_t=o_t\circ tanh(c_t)$ 中要用到tanh？

TA Richard的解释是，对于 new memory cell $\tilde c = f_t\circ c_{t-1}+i_t\circ \tilde c_t$ 这是一个线性的layer，加上tanh非线性因素，能让lstm更powerful.

#### LSTM 直观图解 $i_t = \sigma (W_i[h_{t-1},x_t]+b_i)\tag{input/update gate}$

$o_t = \sigma (W_o[h_{t-1},x_t]+b_o)\tag{output gate}$

$f_t = \sigma (W_f[h_{t-1},x_t]+b_f)\tag{forget gate}$

$c_t= f_t\circ c_{t-1}+i_t\circ \tilde c_t$

LSTM的核心，类似于resnet: ### 训练rnn的一些经验 ### Ensemble ### MT Evaluation

$p_n$ = # matched n-grams / # n-grams in candidate translation

$p_n$ 表示 n-gram 的precision score. 并且，使用 $w_n=1/2^n$ 作为对应的权重。

brevity penalty：短译句容易得高分，因此需要给予惩罚

$BP=\begin{cases} 1, & \text{if c > r}\\ e^{1-r/c}, & \text{if c \le r} \end{cases}$

BLEU:

$BLEU=BP\cdot exp(\sum_{n=1}^Nw_nlogp_n)$

$log BLEU=min(1-\dfrac{r}{c},0)+\sum_{n=1}^Nw_nlogp_n$ ### 又是softmax的问题 在每个时间步，从隐藏状态到词表，使用softmax这一步非常消耗计算力。 Not GPU-friendly! 不知道为啥。。感觉需要好好了解下GPU  ???

### presentation 