# 论文笔记-QANet

paper:

## Motivation

Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models.

encoder 编码方式仅仅由 卷积 和 自注意力 机制构成，没了 rnn 速度就是快。

The key motivation behind the design of our model is the following: convolution captures the local structure of the text, while the self-attention learns the global interaction between each pair of words.

we propose a complementary data augmentation technique to enhance the training data. This technique paraphrases the examples by translating the original sentences from English to another language and then back to English, which not only enhances the number of training instances but also diversifies the phrasing.

## Model

• an embedding layer

• an embedding encoder layer

• a context-query attention layer

• a model encoder layer

• an output layer.

the combination of convolutions and self-attention is novel, and is significantly better than self-attention alone and gives 2.7 F1 gain in our experiments. The use of convolutions also allows us to take advantage of common regularization methods in ConvNets such as stochastic depth (layer dropout) (Huang et al., 2016), which gives an additional gain of 0.2 F1 in our experiments.

CNN 和 self-attention 的结合比单独的 self-attention 效果要好。同时使用了 CNN 之后能够使用常用的正则化方式 dropout, 这也能带来一点增益。

### Input embedding layer

obtain the embedding of each word w by concatenating its word embedding and character embedding.

Each character is represented as a trainable vector of dimension p2 = 200, meaning each word can be viewed as the concatenation of the embedding vectors for each of its characters. The length of each word is either truncated or padded to 16. We take maximum value of each row of this matrix to get a fixed-size vector representation of each word.

### Embedding encoding layer

The encoder layer is a stack of the following basic building block: [convolution-layer × # + self-attention-layer + feed-forward-layer]

• convolution: 使用 depthwise separable convolutions 而不是用传统的 convolution，因为作者发现 it is memory efficient and has better generalization. 怎么理解这个，还得看原 paper. The kernel size is 7, the number of filters is d = 128.

Each of these basic operations (conv/self-attention/ffn) is placed inside a residual block, shown lower-right in Figure 1. For an input x and a given operation f, the output is f(layernorm(x))+x.

### Context-Query Attention Layer

content: $C={c_1, c_2,…,c_n}$

query: $Q={q_1,q_2,…q_m}$.

• content: [batch, content_n, embed_size]

• query: [batch, query_m, embed_size]

sim_matrix: [batch, content_n, query_m]

The similarity function used here is the trilinear function (Seo et al., 2016). $f(q,c)=W_0[q,c,q\circ c]$.

#### content-to-query

$A = \tilde SQ^T$, shape = [batch, content_n, embed_size]

#### query_to_content

Empirically, we find that, the DCN attention can provide a little benefit over simply applying context-to-query attention, so we adopt this strategy.

$\tilde S$.shape=[batch, content_n, query_m]

$\overline S^T$.shape=[batch, query_m, content_n]

$C^T$.shape=[batch, query_m, embed_size]

### Ouput layer

$$p^1=softmax(W_1[M_0;M_1]), p^2=softmax(W_2[M_0;M_2])$$

$$L(\theta)=-\dfrac{1}{N}\sum_i^N[log(p^1_{y^1})+log(p^2_{y^2})]$$

### QANet 哪里好，好在哪儿？

• separable conv 不仅参数量少，速度快，还效果好。将 sep 变成传统 cnn, F1 值减小 0.7.

• 去掉 CNN， F1值减小 2.7.

• 去掉 self-attention, F1值减小 1.3.

• layer normalization

• residual connections

• L2 regularization

Xie Pan

2018-09-22

2021-06-29