# 论文笔记 Deep Transition Architecture

## paper 1

paper: Self-Attention: A Better Building Block for Sentiment Analysis Neural Network Classifiers, 2018 WASSA@EMNLP

• Sinusoidal Position Encoding

• Learned Position Encoding

• Relative Position Representations

Sinusoidal 是在 Transformer 中使用的, 好处在于即使是测试集中出现 sentence 的长度比训练集中所有的 sentence 都要长，也能计算其 position encoding.

Relative 是效果最好的,作者和 Tansformer 的作者是一样的，值得一看。Self-attention with relative position representations

For this method, the self-attention mechanism is modified to explicitly learn the relative positional information between every two sequence positions. As a result, the input sequence is modeled as a labeled, directed, fully-connected graph, where the labels represent positional information. A tunable parameter k is also introduced that limits the maximum distance considered between two sequence positions. [Shaw et al., 2018] hypothesized that this will allow the model to generalize to longer sequences at test time.

## paper 2

• encoder transition

• query transition

• decoder transition

### GRU

$$h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t$$

candidate state:

$$\tilde h_t = tanh(W_{xh}x_t + r_t\odot (W_{hh}h_{t-1}))$$

reset gate:

$$r_t = \sigma(W_{xr}x_t+W_{hr}h_{t-1})$$

update gate:

$$z_t=\sigma(W_{xz}x_t+W_{hz}h_{t-1})$$

### T-GRU (transition GRU)

$$h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t$$

candidate state:

$$\tilde h_t = tanh(r_t\odot (W_{hh}h_{t-1}))$$

reset gate:

$$r_t = \sigma(W_{hr}h_{t-1})$$

update gate:

$$z_t=\sigma(W_{hz}h_{t-1})$$

### L-GRU( Linear Transformation enhanced GRU)

$$h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t$$

candidate state:

$$\tilde h_t = tanh(W_{xh}x_t + r_t\odot (W_{hh}h_{t-1}))+ l_t\odot H(x_t)$$

$$H(x_t)=W_xx_t$$

$$l_t=\sigma(W_{xl}x_t+W_{hl}h_{t-1})$$

$\tilde h_t = tanh(W_{xh}x_t +W_{hh2}h_{t-1} +l_t\odot W_{xh2}x_t+ r_t\odot (W_{hh}h_{t-1}))$

### DNMT

#### Decoder

$L_s$ 表示 encoder transition 的深度 depth. $j$ 表示 current time step.

$$\overrightarrow h_{j,0}=L-GRU(x_j, \overrightarrow h_{j-1,L_s})$$

$$\overrightarrow h_{j,k}=T-GRU(\overrightarrow h_{j, k-1}),\text{ for } 1\le k\le L_s$$

#### Decoder

• query transition: depth $L_q$

• decoder transition: depth $L_d$