论文笔记-QANet

paper: combining local convolution with local self-attention for reading comprehension

## Motivation

Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models.
encoder 编码方式仅仅由 卷积 和 自注意力 机制构成,没了 rnn 速度就是快。

The key motivation behind the design of our model is the following: convolution captures the local structure of the text, while the self-attention learns the global interaction between each pair of words.
这篇论文最主要的创新点:使用 CNN 来捕捉文本结构的局部信息,使用 self-attention 来学习全局中每两个词之间的交互信息,使得其能耦合上下文信息。相比 RNN,attention 能够有效的解决长期依赖问题。只是相比少了词序信息。说到底,也是一种 contextualize 的 encoder 方式。

we propose a complementary data augmentation technique to enhance the training data. This technique paraphrases the examples by translating the original sentences from English to another language and then back to English, which not only enhances the number of training instances but also diversifies the phrasing.
使用了一种数据增强的方式,先将源语言转换成另一种语言,然后再翻译回英语。这样能有效增加训练样本,同时也丰富了短语的多样性。

Model

模型分为5部分:
- an embedding layer
- an embedding encoder layer
- a context-query attention layer
- a model encoder layer
- an output layer.

the combination of convolutions and self-attention is novel, and is significantly better than self-attention alone and gives 2.7 F1 gain in our experiments. The use of convolutions also allows us to take advantage of common regularization methods in ConvNets such as stochastic depth (layer dropout) (Huang et al., 2016), which gives an additional gain of 0.2 F1 in our experiments.
CNN 和 self-attention 的结合比单独的 self-attention 效果要好。同时使用了 CNN 之后能够使用常用的正则化方式 dropout, 这也能带来一点增益。

Input embedding layer

obtain the embedding of each word w by concatenating its word embedding and character embedding.

由词向量和字符向量拼接而成。其中词向量采用预训练的词向量 Glove,并且不可训练,fixed. 只有 OOV (out of vocabulary) 是可训练的,用来映射所有不在词表内的词。

Each character is represented as a trainable vector of dimension p2 = 200, meaning each word can be viewed as the concatenation of the embedding vectors for each of its characters. The length of each word is either truncated or padded to 16. We take maximum value of each row of this matrix to get a fixed-size vector representation of each word.
字符向量的处理。每个字母是可训练的,对应的维度是 200 维。然后每个词都 truncated 或者 padded 成16个字母,保证每个词的向量维度是一样大小。

所以一个词的向量维度是 \(300+200=500\).

Embedding encoding layer

The encoder layer is a stack of the following basic building block: [convolution-layer × # + self-attention-layer + feed-forward-layer]

其中:
- convolution: 使用 depthwise separable convolutions 而不是用传统的 convolution,因为作者发现 it is memory efficient and has better generalization. 怎么理解这个,还得看原 paper. The kernel size is 7, the number of filters is d = 128.

Each of these basic operations (conv/self-attention/ffn) is placed inside a residual block, shown lower-right in Figure 1. For an input x and a given operation f, the output is f(layernorm(x))+x.

在 cnn/self-attention/ffn 层都有 layer normalization.

为什么要用 CNN:

用来获取局部信息 k-gram features

相信看了这个图能对 QANet 中的 cnn 怎么实现的更清楚了。上图中每个卷积核的尺寸分别是 [2, embed_size], [3, embed_size], [3, embed_size]. padding参数 使用的是 "SAME". 得到 3 个 [1, sequence_len],然后拼接起来, 得到最终结果 [filters_num, sequence_len].

在 QANet 的实现中,kernel_size 都设置为7, num_filters=128.

为什么要用 self-attention

用来获取全局信息。

上图中的这种方式显然不太好,复杂度高且效果不好。于是有了 self-attention.

矩阵内部向量之间作內积,并通过 softmax 得到其他词对于 "The" 这个词的权重大小(权重比例与相似度成正比,这里看似不太合理 similarity == match??,但实际上效果很不错,可能跟词向量的训练有关)。

然后将对应的权重大小 \([w_1,w_2,w_3,w_4,w_5]\) 与对应的词相乘,累和得到蕴含了上下文信息的 contextualized "The".


并且,这是可以并行化的。大大加速了训练速度。

Context-Query Attention Layer

跟 BIDAF 是一样的。来,不看笔记把公式过一遍。

content: \(C=\{c_1, c_2,...,c_n\}\)
query: \(Q=\{q_1,q_2,...q_m\}\).

所以 embeded 之后,
- content: [batch, content_n, embed_size]
- query: [batch, query_m, embed_size]

做矩阵相乘得到相似矩阵 similarity matrix \(S\in R^{n\times m}\):
sim_matrix: [batch, content_n, query_m]

The similarity function used here is the trilinear function (Seo et al., 2016). \(f(q,c)=W_0[q,c,q\circ c]\).
相似矩阵的计算可以不是直接矩阵相乘,而是加个前馈神经网络。毕竟 similarity 不一定等于 match.

content-to-query

对 S 每一行 row 做 softmax 得到对应的概率,得到权重矩阵 \(\tilde S\in R^{n\times m}\), shape = [batch, content_n, query_m].

然后与 query \(Q^T\) [batch, query_m, embed_size] 矩阵相乘得到编码了 query 信息的 content:
\(A = \tilde SQ^T\), shape = [batch, content_n, embed_size]

query_to_content

Empirically, we find that, the DCN attention can provide a little benefit over simply applying context-to-query attention, so we adopt this strategy.
这里没有采用 BiDAF 里面的方法,而是采用 DCN 中的方式,利用了 \(\tilde S\).

对 S 每一列 column 做 softmax 得到矩阵 \(\overline S\), shape = [batch, content_n, query_n].

然后矩阵相乘得到 \(B=\tilde S \overline S^T C^T\).
\(\tilde S\).shape=[batch, content_n, query_m]
\(\overline S^T\).shape=[batch, query_m, content_n]
\(C^T\).shape=[batch, query_m, embed_size]
所以最后 B.shape=[batch, content_n, embed_size]

Model Encoder Layer

同 BiDAF 一样输入是 \([c,a,c\circ a,c\circ b]\), 其中 a, b 分别是 attention matrix A,B 的行向量。不过不同的是,这里不同 bi-LSTM,而是类似于 encoder 模块的 [conv + self-attention + ffn]. 其中 conv 层数是 2, 总的 blocks 是7.

Ouput layer

\[p^1=softmax(W_1[M_0;M_1]), p^2=softmax(W_2[M_0;M_2])\] 其中 \(W_1, w_2\) 是可训练的参数矩阵,\(M_0, M_1, M_2\) 如图所示。

然后计算交叉熵损失函数:
\[L(\theta)=-\dfrac{1}{N}\sum_i^N[log(p^1_{y^1})+log(p^2_{y^2})]\]

QANet 哪里好,好在哪儿?

- separable conv 不仅参数量少,速度快,还效果好。将 sep 变成传统 cnn, F1 值减小 0.7.
- 去掉 CNN, F1值减小 2.7.
- 去掉 self-attention, F1值减小 1.3.


- layer normalization
- residual connections
- L2 regularization

参考文献