# 论文笔记-CNN与自然语言处理

• Embedding: 使用预训练的中文词向量。

• Encoder: 基于 Bi-GRU 对 passage,query 和 alternatives 进行编码处理。

• Attention: 用 trilinear 的方式，并 mask 之后得到相似矩阵，然后采用类似于 BiDAF 中的形式 bi-attention flow 得到 attened passage.

• contextual: 用 Bi-GRU 对 attened passage 进行编码，得到 fusion.

• match 使用 attention pooling 的方式将 fusion 和 enc_answer 转换为单个 vector. 然后使用 cosin 进行匹配计算得到最相似的答案。

• 可以用 ELMO 或 wordvec 先对训练集进行预训练得到自己的词向量。

• attention 层可以使用更丰富的方式，很多paper 中也有提到。甚至可以加上人工提取的特征。比如苏剑林 blog 中提到的。

• 还有个很重要的就是 match 部分， attention pooling 是否可以换成其他更好的方式？

# ConvS2S

Facebook 的这篇 paper 就改变了这些传统的思维，不仅用 CNN 编码全局信息，而且还能 decoder.

## Motivation

Multi-layer convolutional neural networks create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers.

Hierarchical structure provides a shorter path to capture long-range dependencies compared to the chain structure modeled by recurrent networks, e.g. we can obtain a feature representation capturing relationships within a window of n words by applying only O(n/k) convolutional operations for kernels of width k, compared to a linear number O(n) for recurrent neural networks.

Inputs to a convolutional network are fed through a constant number of kernels and non-linearities, whereas recurrent networks apply up to n operations and non-linearities to the first word and only a single set of operations to the last word. Fixing the number of nonlinearities applied to the inputs also eases learning.

## Ｍodel Architecture

• position embedding

• convolution block structure

• Multi-step attention

### convolution blocks

$$h_l=(XW+b)\otimes \sigma(XV+c)$$

The output of each layer is a linear projection X ∗ W + b modulated by the gates σ(X ∗ V + c). Similar to LSTMs, these gates multiply each element of the matrix X ∗W+b

and control the information passed on in the hierarchy.

$$h_i^l=tanh(XW+b)\otimes \sigma(XV+c)$$

residual connection: 为了得到更 deep 的卷积神经网络，作者增加了残差链接。

$$h_i^l=v(W^l[h_{i-k/2}^{l-1},…,h_{i+k/2}^{l-1}]+b_w^l)+h_i^{l-1}$$

For instance, stacking 6 blocks with k = 5 results in an input field of 25 elements, i.e. each output depends on 25 inputs. Non-linearities allow the networks to exploit the full input field, or to focus on fewer elements if needed.

• ConvS2S 是 1D 卷积，kernel 只是在时间维度上平移，且 stride 的固定 size 为1,这是因为语言不具备图像的可伸缩性，图像在均匀的进行降采样后不改变图像的特征，而一个句子间隔着取词，意思就会改变很多了。

• 在图像中一个卷积层往往有多个 filter，以获取图像不同的 pattern，但是在 ConvS2S 中，每一层只有一个 filter。一个句子进入 filter 的数据形式是 [1, n, d]. 其中 n 为句子长度， filter 对数据进行 n 方向上卷积，而 d 是词的向量维度，可以理解为 channel，与彩色图片中的 rgb 三个 channel 类似。

Facebook 在设计时，并没有像图像中常做的那样，每一层只设置一个 filter。这样做的原因，一是为了简化模型，加速模型收敛，二是他们认为一个句子的 pattern 要较图像简单很多，通过每层设置一个 filter，逐层堆叠后便能抓到所有的 pattern. 更有可能的原因是前者。因为在 transorfmer 中，multi-head attention 多头聚焦取得了很好的效果，说明一个句子的 pattern 是有多个的.

For encoder networks we ensure that the output of the convolutional layers matches the input length by padding the input at each layer. However, for decoder networks we have to take care that no future information is available to the decoder (Oord et al., 2016a). Specifically, we pad the input by k − 1 elements on both the left and right side by zero vectors, and then remove k elements from the end of the convolution output.

### Multi-step Attention

$$d_i^l=W_d^lh_i^l+b_d^l+g_i$$

$$a_{ij}^l=\dfrac{exp(d_i^l\cdot z_j^u)}{\sum_{t=1}^mexp(d_i^l\cdot z_j^u)}$$

$$c_i^l=\sum_{j=1}^ma_{ij}^l(z_j^u+e_j)$$

• 在训练阶段是 teacher forcing, 卷积核 $W_d^l$ 在 target sentence $h^l$ 上移动做卷积得到 $(W_d^lh_i^l + b_d^l)$，类似与 rnn-decoder 中的隐藏状态。然后加上上一个词的 embedding $g_i$,得到 $d_i^l$.

• 与 encdoer 得到的 source sentence 做交互，通过 softmax 得到 attention weights $a_{ij}^l$.

• 得到 attention vector 跟 rnn-decoder 有所不同，这里加上了 input element embedding $e_j$.

We found adding e_j to be beneficial and it resembles key-value memory networks where the keys are the z_j^u and the values are the z^u_j + e_j (Miller et al., 2016). Encoder outputs z_j^u represent potentially large input contexts and e_j provides point information about a specific input element that is useful when making a prediction. Once c^l_i has been computed, it is simply added to the output of the corresponding decoder layer h^l_i.

$z_j^u$ 表示更丰富的信息，而 $e_j$ 能够能具体的指出输入中对预测有用的信息。还是谁用谁知道吧。。

This can be seen as attention with multiple ’hops’ (Sukhbaatar et al., 2015) compared to single step attention (Bahdanau et al., 2014; Luong et al., 2015; Zhou et al., 2016; Wu et al., 2016). In particular, the attention of the first layer determines a useful source context which is then fed to the second layer that takes this information into account when computing attention etc. The decoder also has immediate access to the attention history of the k − 1 previous time steps because the conditional inputs $c^{l-1}_{i−k}, . . . , c^{l-1}i$ are part of $h^{l-1}{i-k}, . . . , h^{l-1}_i$ which are input to $h^l_i$. This makes it easier for the model to take into account which previous inputs have been attended to already compared to recurrent nets where this information is in the recurrent state and needs to survive several non-linearities. Overall, our attention mechanism considers which words we previously attended to (Yang et al., 2016) and performs multiple attention ’hops’ per time step. In Appendix §C, we plot attention scores for a deep decoder and show that at different layers, different portions of the source are attended to.

# FAST READING COMPREHENSION WITH CONVNETS

Gated Linear Dilated Residual Network (GLDR):

a combination of residual networks (He et al., 2016), dilated convolutions (Yu & Koltun, 2016) and gated linear units (Dauphin et al., 2017).

## text understanding with dilated convolution

kernel:$k=[k_{-l},k_{-l+1},…,k_l]$, size=$2l+1$

input: $x=[x_1,x_2,…,x_n]$

dilation: d

$$(k*x)_ t=\sum_{i=-l}^lk_i\cdot x_{t + d\cdot i}$$

Repeated dilated convolution (Yu & Koltun, 2016) increases the receptive region of ConvNet outputs exponentially with respect to the network depth, which results in drastically shortened computation paths.

## model Architecture

The receptive field of this convolutional network grows

exponentially with depth and soon encompasses a long sequence, essentially enabling it to capture

similar long-term dependencies as an actual sequential model.

Convolutional BiDAF. In our convolutional version of BiDAF, we replaced all bidirectional LSTMs with GLDRs . We have two 5-layer GLDRs in the contextual layer whose weights are un-tied. In the modeling layer, a 17-layer GLDR with dilation 1, 2, 4, 8, 16 in the first 5 residual blocks is used, which results in a reception region of 65 words. A 3-layer GLDR replaces the bidirectional LSTM in the output layer. For simplicity, we use same-padding and kernel size 3 for all convolutions unless specified. The hidden size of all GLDRs is 100 which is the same as the LSTMs in BiDAF.

Xie Pan

2018-10-15

2021-06-29