paper: CoQA: A Conversational Question Answering Challenge


We introduce CoQA, a novel dataset for building Conversational Question Answering systems.1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains.

CoQA, 对话式阅读理解数据集。从 7 个不同领域的 8k 对话中获取的 127k 问答对。

The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage.

We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

CoQA 跟传统的 RC 数据集所面临的挑战不一样,主要是指代和推理。

We ask other people a question to either seek or test their knowledge about a subject. Depending on their answer, we follow up with another question and their answer builds on what has already been discussed. This incremental aspect makes human conversations succinct. An inability to build up and maintain common ground in this way is part of why virtual assistants usually don’t seem like competent conversational partners.



而 CoQA 就是要测试这种能力。


In CoQA, a machine has to understand a text passage and answer a series of questions that appear in a conversation. We develop CoQA with three main goals in mind.

The first concerns the nature of questions in a human conversation. Posing short questions is an effective human conversation strategy, but such questions are a pain in the neck for machines.

第一点:人类在对话时,会提出很简短的问题,但这对于机器来说却很难。比如 Q5 “Who?”

The second goal of CoQA is to ensure the naturalness of answers in a conversation. Many existing QA datasets restrict answers to a contiguous span in a given passage, also known as extractive answers (Table 1). Such answers are not always natural, for example, there is no extractive answer for Q4 (How many?) in Figure 1. In CoQA, we propose that the answers can be free-form text (abstractive answers), while the extractive spans act as rationales for the actual answers. Therefore, the answer for Q4 is simply Three while its rationale is spanned across multiple sentences.

第二点:答案不是抽取式的 extractive,而是总结性的 abstractive, free-from text. 比如 Q4. 好难啊!!!

The third goal of CoQA is to enable building QA systems that perform robustly across domains. The current QA datasets mainly focus on a single domain which makes it hard to test the generalization ability of existing models.

第三点:数据来自多种 domain,提高泛化性。

Dataset collection


  1. It consists of 127k conversation turns collected from 8k conversations over text passages (approximately one conversation per

passage). The average conversation length is 15 turns, and each turn consists of a question and an answer.

  1. It contains free-form answers. Each answer has an extractive rationale highlighted in the passage.
  1. Its text passages are collected from seven diverse domains — five are used for in-domain evaluation and two are used for out-of-domain


Almost half of CoQA questions refer back to conversational history using coreferences, and a large portion requires pragmatic reasoning making it challenging for models that rely on lexical cues alone.


The best-performing system, a reading comprehension model that predicts extractive rationales which are further fed into a sequence-to-sequence model that generates final answers, achieves a F1 score of 65.1%. In contrast, humans achieve 88.8% F1, a superiority of 23.7% F1, indicating that there is a lot of headroom for improvement.

Baseline 是将抽取式阅读理解模型转换成 seq2seq 形式,然后从 rationale 中获取答案,最终得到了 65.1% 的 F1 值。

question and answer collection

We want questioners to avoid using exact words in the passage in order to increase lexical diversity. When they type a word that is already present in the passage, we alert them to paraphrase the question if possible.

questioner 提出的问题应尽可能避免使用出现在 passage 中的词,这样可以增加词汇的多样性。

For the answers, we want answerers to stick to the vocabulary in the passage in order to limit the number of possible answers. We encourage this by automatically copying the highlighted text into the answer box and allowing them to edit copied text in order to generate a natural answer. We found 78% of the answers have at least one edit such as changing a word’s case or adding a punctuation.

对于答案呢,尽可能的使用 passage 中出现的词,从而限制出现很多中答案的可能性。作者通过复制 highlighted text(也就是 rationale 吧) 到 answer box,然后让 answerer 去生成相应的 answer. 其中 78% 的答案是需要一个编辑距离,比如一个词的大小写或增加标点符号。

passage collection

Not all passages in these domains are equally good for generating interesting conversations. A passage with just one entity often result in questions that entirely focus on that entity. Therefore, we select passages with multiple entities, events and pronominal references using Stanford CoreNLP (Manning et al., 2014). We truncate long articles to the first few paragraphs that result in around 200 words.

如果一个 passage 只有一个 entity,那么根据它生成的对话都会是围绕这个 entity 的。显然这不是这个数据集想要的。因此,作者使用 Stanford CoreNLP 来对 passage 进行分析后选择多个 entity 和 event 的 passage.

Table 2 shows the distribution of domains. We reserve the Science and Reddit domains for out-ofdomain evaluation. For each in-domain dataset, we split the data such that there are 100 passages in the development set, 100 passages in the test set, and the rest in the training set. For each out-of-domain dataset, we just have 100 passages in the test set.

In domain 中包含 Children, Literature, Mid/HIgh school, News, Wikipedia. 他们分出 100 passage 到开发集(dev dataset), 其余的在训练集 (train dataset). out-of-diomain 包含 Science Reddit ,分别有 100 passage 在开发集中。

test dataset:

Collection multiple answers

Some questions in CoQA may have multiple valid answers. For example, another answer for Q4 in Figure 2 is A Republican candidate. In order to

account for answer variations, we collect three additional answers for all questions in the development and test data.

一个问题可能出现多种回答,因此在dev dataset 和 test dataset 中有三个候选答案。

In the previous example, if the original answer was A Republican Candidate, then the following question Which party does he

belong to? would not have occurred in the first place. When we show questions from an existing conversation to new answerers, it is likely they will deviate from the original answers which makes the conversation incoherent. It is thus important to bring them to a common ground with the original answer.

比如上图中 Q4, 如果回答是 A Republican candidate. 但是整个对话是相关的,所以接下来的问题就会使整个对话显得混乱了。

We achieve this by turning the answer collection task into a game of predicting original answers. First, we show a question to a new answerer, and when she answers it, we show the original answer and ask her to verify if her answer matches the original. For the next question, we ask her to guess the original answer and verify again. We repeat this process until the conversation is complete. In our pilot experiment, the human F1 score is increased by 5.4% when we use this verification setup.

因为机器在学习的时候是有 original answer 进行对比的,同样的这个过程在人工阶段也是需要的,可以减少上诉的混乱情况,answerer 在给出一个答案后,作者会告诉他们是否与 original 匹配,然后直到整个过程完成。

Dataset Analysis

What makes the CoQA dataset conversational compared to existing reading comprehension datasets like SQuAD? How does the conversation flow from one turn to the other? What linguistic phenomena do the questions in CoQA exhibit? We answer these questions below.

在 question 中:

  1. 指代词(he, him, she, it, they)出现的更为频繁, SQuAD 则几乎没有。

  2. SQuAD 中 what 几乎占了一半,CoQA 中问题类型则更为多样, 比如 did, was, is, does 的频率很高。

  3. CoQA 的问题更加简短。见图 3.

  4. answer 有 33% 的是 abstractive. 考虑到人工因素,抽取式的 answer 显然更好写,所以这高于作者预期了。yes/no 的答案也有一定比重。

Conversation Flow

A coherent conversation must have smooth transitions between turns.

一段好的对话是具有引导性的,不断深入挖掘 passage 的信息。

作者将 passage 均匀分成 10 chunks,然后分析随着对话 turn 的变化,其对应的 passage chunks 变化的情况。

Linguistic Phenomena

Relationship between a question and its passage:

  • lexical match: question 和 passage 中至少有一个词是匹配的。

  • Paraphrasing: 解释型。虽然 question 没有与 passage 的词,但是确实对 rationale 的一种解释,也就是换了一种说法,当作问题提出了。通常这里面包含: synonymy(同义词), antonymy(反义词), hypernymy(上位词), hyponymy(下位词) and negation(否定词).

  • Pragmatics: 需要推理的。

Relationship between a question and its conversation history:

  • No coref

  • Explicit coref.

  • Implicit coref.


最近在参加 AI challenge 观点型阅读理解的比赛。数据集形式如下:


  • Embedding: 使用预训练的中文词向量。

  • Encoder: 基于 Bi-GRU 对 passage,query 和 alternatives 进行编码处理。

  • Attention: 用 trilinear 的方式,并 mask 之后得到相似矩阵,然后采用类似于 BiDAF 中的形式 bi-attention flow 得到 attened passage.

  • contextual: 用 Bi-GRU 对 attened passage 进行编码,得到 fusion.

  • match 使用 attention pooling 的方式将 fusion 和 enc_answer 转换为单个 vector. 然后使用 cosin 进行匹配计算得到最相似的答案。

目前能得到的准确率是 0.687. 距离第一名差了 0.1…其实除了换模型,能提升和改进的地方是挺多的。

  • 可以用 ELMO 或 wordvec 先对训练集进行预训练得到自己的词向量。

  • attention 层可以使用更丰富的方式,很多paper 中也有提到。甚至可以加上人工提取的特征。比如苏剑林 blog 中提到的。

  • 还有个很重要的就是 match 部分, attention pooling 是否可以换成其他更好的方式?

但是,不断尝试各种模型的前提也要考虑速度吧。。rnn 实在是太慢了,所以决定试试 CNN 的方式来处理 NLP 的任务。

关于使用 CNN 来处理阅读理解的任务的大作还是挺多的,这里主要介绍这两篇:


paper: Convolutional Sequence to Sequence Learning

这篇 paper 对应的 NLP 任务是机器翻译,除了用 CNN 对 sentence 进行编码之外,其核心是在 decoder 的时候也使用 CNN. 对于阅读理解来说,能够借用的是其编码 sentence 的方式。但这里作为学习,也多了解一下 decoder 吧~

对文本来说,看到 CNN 我们首先想到的是 cnn 能有效利用局部信息,提取出局部特征,所以适合做文本分类。但是对于 机器翻译、阅读理解这样的需要考虑全局信息的任务,CNN 似乎看起来并不那么有效。而且在 decoder 的时候,词的生成是 one by one 的,下一个词的生成是依赖于上一个词的。所以在 decoder 中使用 RNN 也是很自然而然的。

Facebook 的这篇 paper 就改变了这些传统的思维,不仅用 CNN 编码全局信息,而且还能 decoder.


Multi-layer convolutional neural networks create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers.

多层 CNN 具有层级表示结构,相邻的词之间在较低层的 layer 交互,距离较远的词在较高层的 layer 交互(交互的目的就是语义消歧)。

Hierarchical structure provides a shorter path to capture long-range dependencies compared to the chain structure modeled by recurrent networks, e.g. we can obtain a feature representation capturing relationships within a window of n words by applying only O(n/k) convolutional operations for kernels of width k, compared to a linear number O(n) for recurrent neural networks.

层级结构提供了一个更短的路径来获取长期依赖。比如相距为 n 的两个词,在 rnn 中交互需要的步数是 O(n),在层级 CNN 中需要 O(n/k).这样减少了非线性的操作,降低了梯度消失的情况。所以这两个词的交互效果会更好~

Inputs to a convolutional network are fed through a constant number of kernels and non-linearities, whereas recurrent networks apply up to n operations and non-linearities to the first word and only a single set of operations to the last word. Fixing the number of nonlinearities applied to the inputs also eases learning.

输入到 CNN 中每个词都会经历固定的 kernel 和 非线性操作。而输入到 RNN 的,第一个词需要经过 n 个 operations,最后一个词只经历了一个 operations. 作者认为固定的操作更容易学习。


Model Architecture


  • position embedding

  • convolution block structure

  • Multi-step attention

position encoding

这部分在很多地方都出现过了,在没有 rnn 的情况下,都会用 PE 来编码位置信息。但是在这篇 paper 中,作者通过实验发现,PE 作用似乎并不是很重要。

convolution blocks

作者使用的是门激活机制, GLU, gate linear units.

来自于 paper: Language modeling with gated convolutional networks

在这篇 paper 中,作者用无监督的方式,来训练语言模型,将 CNN 得到的语言模型与 LSTM 进行对比。


$$h_l=(XW+b)\otimes \sigma(XV+c)$$

The output of each layer is a linear projection X ∗ W + b modulated by the gates σ(X ∗ V + c). Similar to LSTMs, these gates multiply each element of the matrix X ∗W+b

and control the information passed on in the hierarchy.

如果是 LSTM-style,应该是 GTU:

$$h_i^l=tanh(XW+b)\otimes \sigma(XV+c)$$

作者将两者进行了对比,发现 GLU 效果更好。

residual connection: 为了得到更 deep 的卷积神经网络,作者增加了残差链接。




For instance, stacking 6 blocks with k = 5 results in an input field of 25 elements, i.e. each output depends on 25 inputs. Non-linearities allow the networks to exploit the full input field, or to focus on fewer elements if needed.


从上图中可以看到,当 k=3 时,3 个 blocks,第三层中的每一个输入都与输入中的 7 列有关。所以计算方式是 k + (k-1)* (blocks-1).


  • ConvS2S 是 1D 卷积,kernel 只是在时间维度上平移,且 stride 的固定 size 为1,这是因为语言不具备图像的可伸缩性,图像在均匀的进行降采样后不改变图像的特征,而一个句子间隔着取词,意思就会改变很多了。

  • 在图像中一个卷积层往往有多个 filter,以获取图像不同的 pattern,但是在 ConvS2S 中,每一层只有一个 filter。一个句子进入 filter 的数据形式是 [1, n, d]. 其中 n 为句子长度, filter 对数据进行 n 方向上卷积,而 d 是词的向量维度,可以理解为 channel,与彩色图片中的 rgb 三个 channel 类似。

Facebook 在设计时,并没有像图像中常做的那样,每一层只设置一个 filter。这样做的原因,一是为了简化模型,加速模型收敛,二是他们认为一个句子的 pattern 要较图像简单很多,通过每层设置一个 filter,逐层堆叠后便能抓到所有的 pattern. 更有可能的原因是前者。因为在 transorfmer 中,multi-head attention 多头聚焦取得了很好的效果,说明一个句子的 pattern 是有多个的.

这段话是有问题的吧? filter 的个数难道不是 2d吗? 只不过这里说的 transorfmer 的多头聚焦是值得聚焦到一个词向量中的部分维度。记得在 cs224d 中 manning 曾经讲过一个例子,经过训练或词与词之间的交互后,词向量中的部分维度发生了变化。

在 paper 中,卷积核的尺寸大小是 $W\in R^{2d\times kd}$.

For encoder networks we ensure that the output of the convolutional layers matches the input length by padding the input at each layer. However, for decoder networks we have to take care that no future information is available to the decoder (Oord et al., 2016a). Specifically, we pad the input by k − 1 elements on both the left and right side by zero vectors, and then remove k elements from the end of the convolution output.

在 encoder 和 decoder 网络中,padding 的方式是不一样的。因为在 decoder 的时候不能考虑未来信息.

在 encoder 时,将 (k-1) pad 到左右两边,保证卷积层的长度不变。

在 decoder 中,将 (k-1) pad 到句子的左边。因此生成的词依旧是 one by one.

Multi-step Attention


$$a_{ij}^l=\dfrac{exp(d_i^l\cdot z_j^u)}{\sum_{t=1}^mexp(d_i^l\cdot z_j^u)}$$


上式中,l 表示 decoder 中卷积层的层数,i 表示时间步。

实际上跟 rnn 的 decoder 还是比较接近的。

  • 在训练阶段是 teacher forcing, 卷积核 $W_d^l$ 在 target sentence $h^l$ 上移动做卷积得到 $(W_d^lh_i^l + b_d^l)$,类似与 rnn-decoder 中的隐藏状态。然后加上上一个词的 embedding $g_i$,得到 $d_i^l$.

  • 与 encdoer 得到的 source sentence 做交互,通过 softmax 得到 attention weights $a_{ij}^l$.

  • 得到 attention vector 跟 rnn-decoder 有所不同,这里加上了 input element embedding $e_j$.

至于这里为什么要加 $e_j$?

We found adding e_j to be beneficial and it resembles key-value memory networks where the keys are the z_j^u and the values are the z^u_j + e_j (Miller et al., 2016). Encoder outputs z_j^u represent potentially large input contexts and e_j provides point information about a specific input element that is useful when making a prediction. Once c^l_i has been computed, it is simply added to the output of the corresponding decoder layer h^l_i.

$z_j^u$ 表示更丰富的信息,而 $e_j$ 能够能具体的指出输入中对预测有用的信息。还是谁用谁知道吧。。

关于 multi-hop attention:

This can be seen as attention with multiple ’hops’ (Sukhbaatar et al., 2015) compared to single step attention (Bahdanau et al., 2014; Luong et al., 2015; Zhou et al., 2016; Wu et al., 2016). In particular, the attention of the first layer determines a useful source context which is then fed to the second layer that takes this information into account when computing attention etc. The decoder also has immediate access to the attention history of the k − 1 previous time steps because the conditional inputs $c^{l-1}_{i−k}, . . . , c^{l-1}i$ are part of $h^{l-1}{i-k}, . . . , h^{l-1}_i$ which are input to $h^l_i$. This makes it easier for the model to take into account which previous inputs have been attended to already compared to recurrent nets where this information is in the recurrent state and needs to survive several non-linearities. Overall, our attention mechanism considers which words we previously attended to (Yang et al., 2016) and performs multiple attention ’hops’ per time step. In Appendix §C, we plot attention scores for a deep decoder and show that at different layers, different portions of the source are attended to.

这个跟 memory networks 中的 multi-hop 是有点类似。


Gated Linear Dilated Residual Network (GLDR):

a combination of residual networks (He et al., 2016), dilated convolutions (Yu & Koltun, 2016) and gated linear units (Dauphin et al., 2017).

text understanding with dilated convolution

kernel:$k=[k_{-l},k_{-l+1},…,k_l]$, size=$2l+1$

input: $x=[x_1,x_2,…,x_n]$

dilation: d


$$(k*x)_ t=\sum_{i=-l}^lk_i\cdot x_{t + d\cdot i}$$

为什么要使用膨胀卷积呢? Why Dilated convolution?

Repeated dilated convolution (Yu & Koltun, 2016) increases the receptive region of ConvNet outputs exponentially with respect to the network depth, which results in drastically shortened computation paths.


作者将 GLDR 和 self-attention,以及 RNN 进行了对比,input sequence length n, network width w, kernel size k, and network depth D.

model Architecture

作者与 BiDAF 和 DrQA 进行了对比,将 BiDAF 和 DrQA 中的 BiLSTM 部分替换成 GLDR Convolution.

The receptive field of this convolutional network grows

exponentially with depth and soon encompasses a long sequence, essentially enabling it to capture

similar long-term dependencies as an actual sequential model.

感受野的尺寸大小指数增加,能够迅速压缩 long sentence,并 capture 长期依赖。

Convolutional BiDAF. In our convolutional version of BiDAF, we replaced all bidirectional LSTMs with GLDRs . We have two 5-layer GLDRs in the contextual layer whose weights are un-tied. In the modeling layer, a 17-layer GLDR with dilation 1, 2, 4, 8, 16 in the first 5 residual blocks is used, which results in a reception region of 65 words. A 3-layer GLDR replaces the bidirectional LSTM in the output layer. For simplicity, we use same-padding and kernel size 3 for all convolutions unless specified. The hidden size of all GLDRs is 100 which is the same as the LSTMs in BiDAF.

具体网络结构,实际参数可以看 paper 实验部分。



combining local convolution with local self-attention for reading comprehension


Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models.

encoder 编码方式仅仅由 卷积 和 自注意力 机制构成,没了 rnn 速度就是快。

The key motivation behind the design of our model is the following: convolution captures the local structure of the text, while the self-attention learns the global interaction between each pair of words.

这篇论文最主要的创新点:使用 CNN 来捕捉文本结构的局部信息,使用 self-attention 来学习全局中每两个词之间的交互信息,使得其能耦合上下文信息。相比 RNN,attention 能够有效的解决长期依赖问题。只是相比少了词序信息。说到底,也是一种 contextualize 的 encoder 方式。

we propose a complementary data augmentation technique to enhance the training data. This technique paraphrases the examples by translating the original sentences from English to another language and then back to English, which not only enhances the number of training instances but also diversifies the phrasing.




  • an embedding layer

  • an embedding encoder layer

  • a context-query attention layer

  • a model encoder layer

  • an output layer.

the combination of convolutions and self-attention is novel, and is significantly better than self-attention alone and gives 2.7 F1 gain in our experiments. The use of convolutions also allows us to take advantage of common regularization methods in ConvNets such as stochastic depth (layer dropout) (Huang et al., 2016), which gives an additional gain of 0.2 F1 in our experiments.

CNN 和 self-attention 的结合比单独的 self-attention 效果要好。同时使用了 CNN 之后能够使用常用的正则化方式 dropout, 这也能带来一点增益。

Input embedding layer

obtain the embedding of each word w by concatenating its word embedding and character embedding.

由词向量和字符向量拼接而成。其中词向量采用预训练的词向量 Glove,并且不可训练,fixed. 只有 OOV (out of vocabulary) 是可训练的,用来映射所有不在词表内的词。

Each character is represented as a trainable vector of dimension p2 = 200, meaning each word can be viewed as the concatenation of the embedding vectors for each of its characters. The length of each word is either truncated or padded to 16. We take maximum value of each row of this matrix to get a fixed-size vector representation of each word.

字符向量的处理。每个字母是可训练的,对应的维度是 200 维。然后每个词都 truncated 或者 padded 成16个字母,保证每个词的向量维度是一样大小。

所以一个词的向量维度是 $300+200=500$.

Embedding encoding layer

The encoder layer is a stack of the following basic building block: [convolution-layer × # + self-attention-layer + feed-forward-layer]


  • convolution: 使用 depthwise separable convolutions 而不是用传统的 convolution,因为作者发现 it is memory efficient and has better generalization. 怎么理解这个,还得看原 paper. The kernel size is 7, the number of filters is d = 128.

Each of these basic operations (conv/self-attention/ffn) is placed inside a residual block, shown lower-right in Figure 1. For an input x and a given operation f, the output is f(layernorm(x))+x.

在 cnn/self-attention/ffn 层都有 layer normalization.

为什么要用 CNN:

用来获取局部信息 k-gram features

相信看了这个图能对 QANet 中的 cnn 怎么实现的更清楚了。上图中每个卷积核的尺寸分别是 [2, embed_size], [3, embed_size], [3, embed_size]. padding参数 使用的是 “SAME”. 得到 3 个 [1, sequence_len],然后拼接起来, 得到最终结果 [filters_num, sequence_len].

在 QANet 的实现中,kernel_size 都设置为7, num_filters=128.

为什么要用 self-attention


上图中的这种方式显然不太好,复杂度高且效果不好。于是有了 self-attention.

矩阵内部向量之间作內积,并通过 softmax 得到其他词对于 “The” 这个词的权重大小(权重比例与相似度成正比,这里看似不太合理 similarity == match??,但实际上效果很不错,可能跟词向量的训练有关)。

然后将对应的权重大小 $[w_1,w_2,w_3,w_4,w_5]$ 与对应的词相乘,累和得到蕴含了上下文信息的 contextualized “The”.


Context-Query Attention Layer

跟 BIDAF 是一样的。来,不看笔记把公式过一遍。

content: $C={c_1, c_2,…,c_n}$

query: $Q={q_1,q_2,…q_m}$.

所以 embeded 之后,

  • content: [batch, content_n, embed_size]

  • query: [batch, query_m, embed_size]

做矩阵相乘得到相似矩阵 similarity matrix $S\in R^{n\times m}$:

sim_matrix: [batch, content_n, query_m]

The similarity function used here is the trilinear function (Seo et al., 2016). $f(q,c)=W_0[q,c,q\circ c]$.

相似矩阵的计算可以不是直接矩阵相乘,而是加个前馈神经网络。毕竟 similarity 不一定等于 match.


对 S 每一行 row 做 softmax 得到对应的概率,得到权重矩阵 $\tilde S\in R^{n\times m}$, shape = [batch, content_n, query_m].

然后与 query $Q^T$ [batch, query_m, embed_size] 矩阵相乘得到编码了 query 信息的 content:

$A = \tilde SQ^T$, shape = [batch, content_n, embed_size]


Empirically, we find that, the DCN attention can provide a little benefit over simply applying context-to-query attention, so we adopt this strategy.

这里没有采用 BiDAF 里面的方法,而是采用 DCN 中的方式,利用了 $\tilde S$.

对 S 每一列 column 做 softmax 得到矩阵 $\overline S$, shape = [batch, content_n, query_n].

然后矩阵相乘得到 $B=\tilde S \overline S^T C^T$.

$\tilde S$.shape=[batch, content_n, query_m]

$\overline S^T$.shape=[batch, query_m, content_n]

$C^T$.shape=[batch, query_m, embed_size]

所以最后 B.shape=[batch, content_n, embed_size]

Model Encoder Layer

同 BiDAF 一样输入是 $[c,a,c\circ a,c\circ b]$, 其中 a, b 分别是 attention matrix A,B 的行向量。不过不同的是,这里不同 bi-LSTM,而是类似于 encoder 模块的 [conv + self-attention + ffn]. 其中 conv 层数是 2, 总的 blocks 是7.

Ouput layer

$$p^1=softmax(W_1[M_0;M_1]), p^2=softmax(W_2[M_0;M_2])$$

其中 $W_1, w_2$ 是可训练的参数矩阵,$M_0, M_1, M_2$ 如图所示。



QANet 哪里好,好在哪儿?

  • separable conv 不仅参数量少,速度快,还效果好。将 sep 变成传统 cnn, F1 值减小 0.7.

  • 去掉 CNN, F1值减小 2.7.

  • 去掉 self-attention, F1值减小 1.3.

  • layer normalization

  • residual connections

  • L2 regularization


论文笔记 Pointer Networks and copy mechanism


Pointer Network


We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence.

提出来了一种新的架构来学习得到这样的输出序列的条件概率,其中输出序列中的元素是输入序列中离散的 tokens.

Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence [1] and Neural Turing Machines [2], because the number of target classes in each step of the output depends on the length of the input, which is variable.

这样简单的从输入序列中 copy 输出相关的序列在 seq2seq 或是神经图灵机都很难实现,因为在 decoder 的每一步输出的次的类别依赖于输入序列的长度,这个长度是变化的。

Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class.


It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output.

同之前的 attention 不同的是,之前的 attention 是 decoder 时每一步计算通过 RNN 编码后的输入序列的隐藏变量与当前向量表示的 attention vector,然后生成当前词。而 Ptr-Net 则是使用 attention 作为指针,从输入序列中选择成员作为输出。

We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems – finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem – using training examples


Ptr-Net 可以用来学习类似的三个几何问题。

Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on.

Ptr-Net 不仅可以提升 seq2seq with attention,而且能够泛化到变化的 dictionayies.

从摘要以及 Introduction 来说, Ptr-Net 主要是解决两个方面的问题。

  • 一是,简单的 copy 在传统的方法中很难实现,而 Ptr-Net 则是直接从输入序列中生成输出序列。

  • 而是,可以解决输出 dictionary 是变化的情况。普通的 Seq2Seq 的 output dictionary 大小是固定的,对输出中包含有输入单词(尤其是 OOV 和 rare word) 的情况很不友好。一方面,训练中不常见的单词的 word embedding 质量也不高,很难在 decoder 时预测出来,另一方面,即使 word embedding 很好,对一些命名实体,像人名等,word embedding 都很相似,也很难准确的 reproduce 出输入提到的单词。Point Network 以及在此基础上后续的研究 CopyNet 中的 copy mechanism 就可以很好的处理这种问题,decoder 在各 time step 下,会学习怎样直接 copy 出现在输入中的关键字。

Model Architecture

在介绍 Ptr-Net 之前,作者先回顾了一下基本模型 seq2seq 和 input-attention.

sequence-to-sequence Model

实际上 seq2seq 解决的问题是在当前样本空间里面,给定输入下,使得输出序列的概率最大化。其实类似的 MT,QA,Summarization 都可以看作是这一类问题。只不过根据输入和输出之间的关系,调整相应的模型。



$$\theta^* = {argmax}{\theta}\sum{P,C^P}logp(C^P|P;\theta)$$


In this sequence-to-sequence model, the output dictionary size for all symbols $C_i$ is fixed and equal to n, since the outputs are chosen from the input. Thus, we need to train a separate model for each n. This prevents us from learning solutions to problems that have an output dictionary with a size that depends on the input sequence length.

在 seq2seq 模型中,输出的 dictionary 是固定大小的。因为不能解决 dictionary 是变化的情况。

Content Based Input Attention

在每一个 decoder step,先计算 $e_{ij}$ 得到对齐概率(或者说 how well input position j matches output position i),然后做一个 softmax 得到 $a_{ij}$,再对 $a_{ij}$ 做一个加权和作为 context vector $c_i$,得到这个 context vector 之后在固定大小的 output dictionary 上做 softmax 预测输出的下一个单词。

This model performs significantly better than the sequence-to-sequence model on the convex hull problem, but it is not applicable to problems where the output dictionary size depends on the input.

Nevertheless, a very simple extension (or rather reduction) of the model allows us to do this easily.


seq2seq 模型的输出词是在固定的 dictionary 中进行 softmax,并选择概率最大的词,从而得到输出序列。但这里的输出 dictionary size 是取决于 input 序列的长度的。所以作者提出了新的模型,其实很简单。

$$u_j^i=v^Ttanh(W_1e_j+W_2d_i) ,j\in(1,…,n)$$


i 表示decoder 的时间步,j 表示输入序列中的index. 所以$e_j$ 是 encoder 编码后的隐藏向量,$d_i$ 是 decoder 当前时间步 i 的隐藏向量。跟一般的 attention 基本上一致。只不过得到的 softmax 概率应用在输入序列 $C_1,…,C_{i-1}$ 上。

Dataset Structure



We address an important problem in sequence-to-sequence (Seq2Seq) learning referred to as copying, in which certain segments in the input sequence are selectively replicated in the output sequence. A similar phenomenon is observable in human language communication. For example, humans tend to repeat entity names or even long phrases in conversation.

还是前面提到的问题,seq2seq 很难解决简单的 copy 问题。而在人类的对话中,出现 copy 的现象是很常见的。尤其是 命令实体 或者是长短语。

The challenge with regard to copying in Seq2Seq is that new machinery is needed to decide when to perform the operation.

这也是 seq2seq 模型所需面对的挑战。

For example:

可以看到,对于 Chandralekha 这类实体词,可能是 OOV,也可能是其他实体或者是日期等很难被 decoder “还原” 出来的信息,CopyNet 可以更好的处理这类的信息。


  • What to copy: 输入中的哪些部分应该被 copy?

  • Where to paste: 应该把这部分信息 paste 到输出的哪个位置?

Model Architecture

作者从两个角度来理解 CopyNet:

  • From a cognitive perspective, the copying mechanism is related to rote memorization, requiring less understanding but ensuring high literal fidelity. 从认知学角度,copy机制近似于死记硬背,不需要太多的理解,但是要保证文字的保真度。

  • From a modeling perspective, the copying operations are more rigid and symbolic, making it more difficult than soft attention mechanism to integrate into a fully differentiable neural model. 从模型的角度,copy 操作更加死板和符号化,这也使得相比 soft attention 机制更难整合到一个完整的可微分的神经模型中去。

整体还是基于 encoder-decoder 模型。


LSTM 将 source sequence 转换为隐藏状态 M(emory) $h_1,…,h_{T_S}$.


同 cannonical 的 decoder 一样,使用 RNN 读取 encoder 的隐藏状态 M. 但和传统的 decoder 不一样,他有如下区别:

  • Prediction: COPYNET predicts words based on a mixed probabilistic model of two modes, namely the generate-mode and the copymode, where the latter picks words from the source sequence. 下一个词的预测由两种模式混合而成。生成 generate-mode 和 copy-mode. 后者就像前面 Ptr-Net 所说的,在 source sentence 获取词。
  • State Update: the predicted word at time t−1 is used in updating the state at t, but COPYNET uses not only its word-embedding but also its corresponding location-specific hidden state in M (if any). 更新 decoder 中的隐藏状态时,t 时间步的隐藏状态不仅与 t-1 步生成词的 embedding vector 有关,还与这个词对应于 source sentence 中的隐藏状态的位置有关。
  • Reading M: in addition to the attentive read to M, COPYNET also has“selective read” to M, which leads to a powerful hybrid of

content-based addressing and location-based addressing. 什么时候需要 copy,什么时候依赖理解来回答,怎么混合这两种模式很重要。

个人思考: 感觉不管要不要 copy 都应该是在基于理解的基础上进行的。但是因为 OOV 或者当前词的 embedding vector 训练的不好,那就无法理解了对吧? 是否可以添加 gate 机制呢? 机器到底还是没理解语言对吧? 貌似是个可以创新的点。


Prediction with Copying and Generation:$s_t\rightarrow y_t$

这部分是从 decoder 隐藏状态 $s_t$ 到输出词 $y_t$ 的过程。传统的encoder-decoder 是一个线性映射就可以了。

词表 $\mathcal{V}={v_1,…,v_N}$, 未登录词 OOV(out of vocabulary) 用 UNK 来表示(unk应该也会有对应的 embedding vector). 以及用来表示输入序列中的 unique words $X={x_1,…,x_{T_S}}$. 其中 X 使得 copynet 输出 OOV.


简而言之(In a nutshell), 对于当前 source sentence X 输出的词表范围 $\mathcal{V}\cup \text{UNK} \cup X$.

给定 decoder 中当前时间步的隐藏状态 $s_t$, 以及 encoder 的隐藏状态序列 M.

$$p(y_t|s_t,y_{t-1},c_t,M)=p(y_t,g|s_t,y_{t-1},c_t,M) + p(y_t,c|s_t,y_{t-1},c_t,M)$$

其中 g 代表 generate mode. c 代表 copy mode.

我们知道对于 encoder 部分的输出 $h_1,…,h_{T_S}$, 记做 M,M 其实同时包含了语义和位置信息。那么 decoder 对 M 的读取有两种形式:

  • Content-base

Attentive read from word-embedding

  • location-base

Selective read from location-specific hidden units

两种模式对应的概率计算,以及 score function:

$$p(y_t,g|\cdot)=\begin{cases} \dfrac{1}{Z}e^{\psi_g(y_t)}&y_t\in V\

0,&y_t\in X \bigcap \overline V\

\dfrac{1}{Z}e^{\psi_g(UNK)},&y_t\notin V\cup X


$$p(y_t,c|\cdot)=\begin{cases}\dfrac{1}{Z}\sum_{j:x_j=y_t}{e^{\psi_c(x_j)}},&y_t\in X\0&\text {otherwise}\end{cases}$$

上面两个公式叠加(相加)可以表示为下图(可以将目标词看作类别为 4 的分类。):

其中 $\psi_g(\cdot)$ 和 $\psi_c(\cdot)$ 是 generate mode 和 copy mode 的 score function.

Z 是两种模型共享的归一化项,$Z=\sum_{v\in V\cup{UNK}}e^{\psi_g(v)}+\sum_{x\in X}e^{\psi_c(x)}$.

然后对相应的类别计算对应的 score.


$$\psi_g(y_t=v_i)=\nu_i^TW_os_t, v_i\in V\cup UNK$$

  • $W_o\in R^{(N+1)\times d_s}$

  • $\nu_i$ 是 $v_i$ 对应的 one-hot 向量. 得到的结果是当前词的概率。

generate-mode 的 score $\psi(y_t=v_i)$ 和普通的 encoder-decoder 是一样的。全链接之后的 softmax.


$$\psi(y_t=x_j)=\sigma(h_j^TW_c)s_t,x_j\in \mathcal{V}$$

  • $h_j$ 是 encoder hidden state. j 表示输入序列中的位置。

  • $W_c\in R^{d_h\times d_s}$ 将 $h_j$ 映射到跟 $s_t$ 一样的语义空间。

  • 作者发现使用 tanh 非线性变换效果更好。同时考虑到 $y_t$ 这个词可能在输入中出现多次,所以需要考虑输入序列中所有的为 $y_t$ 的词的概率的类和。

state update

上面一部分讲的是怎么从 decoder 中的隐藏状态计算对应的 vocabulary,也就是 $s_t\rightarrow y_t$. 那么怎么计算当前时间步的隐藏状态呢? 我们知道传统的 encoder-decoder 中隐藏状态就是 content-based atention vector. 但是在 copynet 里面,作者对 $y_{t-1}\rightarrow s_t$ 这个计算方式做了一定的修改。

先回顾下基本的 attention 模块,decoder 中隐藏状态的更新 $s_t=f(y_{t-1},s_{t-1},c_t)$, 其中 $c_t$ 也就是 attention 机制:



CopyNet 的 $y_{t-1}$ 在这里有所不同。不仅仅考虑了词向量,还使用了 M 矩阵中特定位置的 hidden state,或者说,$y_{t−1}$ 的表示中就包含了这两个部分的信息 $[e(y_{t−1});\zeta(y_{t−1})]$,$e(y_{t−1})$ 是词向量,后面多出来的一项 $\zeta(y_{t−1})$ 叫做 selective read, 是为了连续拷贝较长的短语。和attention 的形式差不多,是 M 矩阵中 hidden state 的加权和.


$$\rho_{t\tau}=\begin{cases}\dfrac{1}{K}p(x_{\tau},c|s_{t-1},M),& x_{\tau}=y_{t-1}\

0,& \text{otherwise}


  • 当 $y_{t-1}$ 没有出现在 source sentence中时, $\zeta(y_{t-1})=0$.

  • 这里的 $K=\sum{\tau’:x_{\tau’}=y_{t-1}}p(x_{\tau’},c|s_{t-1},M)$ 是类和。还是因为输入序列中可能出现多个当前词,但是每个词在 encoder hidden state 的向量表示是不一样的,因为他们的权重也是不一样的。

  • 这里的 p 没有给出解释,我猜跟前面计算 copy 的 score 是一致的?

  • 直观上来看,当 $\zeta(y_{t-1})$ 可以看作是选择性读取 M (selective read). 先计算输入序列中对应所有 $y_{t-1}$ 的权重,然后加权求和,也就是 $\zeta(y_{t-1})$.

Hybrid Adressing of M

包括两种 Addressing 方式: content-based and location-based assressing.

location-based Addressing:

$$\zeta(y_{t-1}) \longrightarrow{update} \ s_t \longrightarrow predict \ y_t \longrightarrow sel. read \zeta(y_t)$$



$$L=-\dfrac{1}{N}\sum_{k=1}^N\sum_{t=1}^Tlog[p(y_t^{(k)}|y_{<t}^{(k)}, X^{(k)})]$$

N 是batch size,T 是 object sentence 长度。

论文笔记-Match LSTM


SQuAD the answers do not come from a small set of candidate

answers and they have variable lengths. We propose an end-to-end neural architecture for the task.

针对 SQuAD 这样的阅读理解式任务提出的端到端的模型。 SQuAD 的答案不是从候选词中提取,而是类似于人类的回答,是不同长度的句子。

The architecture is based on match-LSTM, a model we proposed

previously for textual entailment, and Pointer Net, a sequence-to-sequence model proposed by Vinyals et al. (2015) to constrain the output tokens to be from the input sequences.

主要是基于 Pointer Networks

关于阅读理解的数据集 benchmark dataset:

  • MCTest: A challenge dataset for the open-domain machine comprehension of text.

  • Teaching machines to read and comprehend.

  • The Goldilocks principle: Reading children’s books with explicit memory representations.

  • Towards AI-complete question answering: A set of prerequisite toy tasks.

  • SQuAD: 100,000+ questions for machine comprehension of text.


Traditional solutions to this kind of question answering tasks rely on NLP pipelines that involve multiple steps of linguistic analyses and feature engineering, including syntactic parsing, named entity recognition, question classification, semantic parsing, etc. Recently, with the advances of applying neural network models in NLP, there has been much interest in building end-to-end neural architectures for various NLP tasks, including several pieces of work on machine comprehension.

传统的智能问答任务整个流程包括 句法分析、命名实体识别、问题分类、语义分析等。。随着深度学习的发展,端到端的模型开始出现。

End-to-end model architecture:

  • Teaching machines to read and comprehend.

  • The Goldilocks principle: Reading children’s books with explicit memory representations.

  • Attention-based convolutional neural network for machine comprehension

  • Text understanding with the attention sum reader network.

  • Consensus attention-based neural networks for chinese reading comprehension.

However, given the properties of previous machine comprehension datasets, existing end-to-end neural architectures for the task either rely on the candidate answers (Hill et al., 2016; Yin et al., 2016) or assume that the answer is a single token (Hermann et al., 2015; Kadlec et al., 2016; Cui et al., 2016), which make these methods unsuitable for the SQuAD dataset.

之前的模型的 answer 要么是从候选答案中选择,要么是一个简单的符号。这都不适合 SQuDA.

模型是基于作者早期提出的用于 textual entailment 的 match-LSTMLearning natural language inference with LSTM,然后进一步应用了 Pointer Net(https://papers.nips.cc/paper/5866-pointer-networks), 从而允许预测的结果能够从输入中获得,而不是从一个固定的词表中获取。

We propose two ways to apply the Ptr-Net model for our task: a sequence model and a boundary model. We also further extend the boundary model with a search mechanism.


Model Architecture


Pointer Network

Pointer Network (Ptr-Net) model : to solve a special kind of problems where we want to generate an output sequence whose tokens must come from the input sequence. Instead of picking an output token from a fixed vocabulary, Ptr-Net uses attention mechanism as a pointer to select a position from the input sequence as an output symbol.

从输入 sentences 中生成 answer.

类似于 Pointer Network 的模型:



  • An LSTM preprocessing layer that preprocesses the passage and the question using LSTMs. 使用 LSTM 处理 question 和 passage.

  • A match-LSTM layer that tries to match the passage against the question. 使用 match-LSTM 对lstm编码后的 question 和 passage 进行匹配。

  • An Answer Pointer (Ans-Ptr) layer that uses Ptr-Net to select a set of tokens from the passage as the answer. The difference between the two models only lies in the third layer. 使用 Pointer 来选择 tokens.

LSTM preprocessing Layer

$$H^p=\overrightarrow {LSTM}(P), H^q=\overrightarrow {LSTM}(Q)$$

直接使用单向LSTM,每一个时刻的隐含层向量输出 $H^p\in R^{l\times P}, H^q\in R^{l\times Q}$ 只包含左侧上下文信息.

Match-LSTM Layer

$$\overrightarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overrightarrow {h^r}_{i-1}+b^p)\otimes e_Q)\in R^{l\times Q}$$

$$\overrightarrow \alpha_i=softmax(w^T\overrightarrow G_i + b\otimes e_Q)\in R^{1\times Q}$$

the resulting attention weight $\overrightarrow α_{i,j}$ above indicates the degree of matching between the

$i^{th}$ token in the passage with the $j^{th}$ token in the question.

其中 $W^q,W^p,W^r \in R^{l\times l}, b^p,w\in R^l, b\in R$

所以 $\overrightarrow α_{i}$ 表示整个 question 与 passage 中的第 i 个词之间的 match 程度,也就是通常理解的 attention 程度。

传统的 attention 就是将 passage 和 question 矩阵相乘,比如 transformer 中 query 和 keys 相乘。复杂一点可能就是 dynamic memory networks 中的将 两个需要 match 的向量相减、element-wise相乘之后,使用两层的前馈神经网络来表示。

这里的 attention score 的计算方式又不一样了。 $\overrightarrow{h^r_{i-1}}$ 是通过 LSTM 耦合 weighted queston 和 passage 中上一个词得到的信息。


$$\overrightarrow z_i=\begin{bmatrix} h^p \ H^q\overrightarrow {\alpha_i^T} \ \end{bmatrix} $$


然后类似于LSTM将 $\overrightarrow{h_{i-1}^r}$ 和 当前 passage 的表示 $H^p_i$ 耦合得到的 $R^{l\times 1}$ 的向量重复Q 次,得到 $R^{l\times Q}$,所以 $\overrightarrow G_i\in R^{l\times Q}$, 在通过一个softmax-affine网络得到 attention weights.

整个思路下来,就是 attention score 不是通过矩阵相乘,也不是向量 $h^p_i, H^q$ 相减之后通过神经网络得到。但是也相似,就是对当前要匹配的两个向量 $h^p_i, H^q$ 通过两层神经网络得到,其中的对当前向量 $H_i^p$ 和 $\overrightarrow {h_{i-1}^r}$ 要重复 Q 次。。。其实跟 DMN 还是相似的,只不过不是简单的 attention 当前的向量,还用了 LSTM 来耦合之前的信息。

最终得到想要的结合了 attention 和 LSTM 的输出 $\overrightarrow h^r$.

作者做了一个反向的 LSTM. 方式是一样的:

$$\overleftarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overleftarrow {h^r}_{i-1}+b^p)\otimes e_Q)$$

$$\overleftarrow \alpha_i=softmax(w^T\overleftarrow G_i + b\otimes e_Q)$$

同样得到 $\overleftarrow {h_i^r}$.

  • $\overrightarrow {H^r}\in R^{l\times P}$ 表示隐藏状态 $[\overrightarrow {h^r_1}, \overrightarrow {h^r_2},…,\overrightarrow {h^r_P}]$.

  • $\overleftarrow {H^r}\in R^{l\times P}$ 表示隐藏状态 $[\overleftarrow {h^r_1}, \overleftarrow {h^r_2},…,\overleftarrow {h^r_P}]$.

然后把两者堆叠起来得到通过 question 匹配之后的 passage 向量表示: $H^r=\begin{bmatrix} \overrightarrow H^r \ \overleftarrow H^r \end{bmatrix} \in R^{2l\times P}$

Answer Pointer Layer

The Sequence Model

The answer is represented by a sequence of integers $a=(a_1,a_2,…)$ indicating the positions of the selected tokens in the original passage.

再一次利用 attention,$\beta_{k,j}$ 表示 answer 中第 k 个token选择 passage 中第 j 个次的概率。所以 $\beta_k\in R^{P+1}$.

$$F_k=tanh(V\tilde {H^r}+(W^ah^a_{k-1}+b^a)\otimes e_{P+1})\in R^{l\times P+1}$$

$$\beta_k=softmax(v^TF_k+c\otimes e_{P+1}) \in R^{1\times (P+1)}$$

其中 $\tilde H\in R^{2l\times (P+1)}$ 表示 $H^r$ 和 zero vector 的叠加, $\tilde H=[H^r, 0], V\in R^{l\times 2l}, W^a\in R^{l\times l}, b^a,v\in R, c\in R$.

所以还是跟 match-LSTM 一样,先对 $H^r$ 中的每一个词通过全链接表示 $W^ah^a_{k+1}+b^a$, 然后重复 P+1 次,得到 $R^{l\times (P+1)}$. 在通过激活函数 tanh, 再通过一个全连接神经网络,然后使用 softmax 进行多分类。

$$h_k^a=\overrightarrow{LSTM}(\tilde {H^r}\beta_k^T, h^a_{k-1})$$

这里是把 $\tilde H^r$ 与权重 $\beta_k$ 矩阵相乘之后的结果作为 LSTM k 时刻的输入。很玄学, 感觉可以看作是 self-attention 结合了 LSTM.

对生成 answer sequence 的概率进行建模:

$$p(a|H^r)=\prod_k p(a_k|a_1,a_2,…,a_{k-1}, H^r)$$



目标函数 loss function:

$$-\sum_{n=1}^N logp(a_n|P_n,Q_n)$$

The Boundary Model

So the main difference from the sequence model above is that in the boundary model we do not need to add the zero padding to Hr, and the probability of generating an answer is simply modeled as:

$$p(a|H^r)=p(a_s|H^r)p(a_e|a_s, H^r)$$

Search mechanism, and bi-directional Ans-Ptr.



SQuAD: Passages in SQuAD come from 536 articles from Wikipedia covering a wide range of topics. Each passage is a single paragraph from a Wikipedia article, and each passage has around 5 questions associated with it. In total, there are 23,215 passages and 107,785 questions. The data has been split into a training set (with 87,599 question-answer pairs), a development set (with 10,570 questionanswer pairs) and a hidden test set


  • dimension l of the hidden layers is set to 150 or 300.

  • Adammax: $\beta_1=0.9, \beta_2=0.999$

  • minibatch size = 30

  • no L2 regularization.


论文笔记-QA BiDAF



Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query.

机器阅读的定义,query 和 context 之间的交互。

Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention.

传统的使用 attention 机制的方法。

In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bidirectional attention flow mechanism to obtain a query-aware context representation without early summarization.

本文提出的方法 BiDAF. 使用多阶层次双向 attention flow 机制来表示内容的不同 levels 的粒度,从而获得 query-aware 的 context,而不使用 summarization.


Attention mechanisms in previous works typically have one or more of the following characteristics. First, the computed attention weights are often used to extract the most relevant information from the context for answering the question by summarizing the context into a fixed-size vector. Second, in the text domain, they are often temporally dynamic, whereby the attention weights at the current time step are a function of the attended vector at the previous time step. Third, they are usually uni-directional, wherein the query attends on the context paragraph or the image.

对 atention 在以前的研究中的特性做了一个总结。

  • 1.attention 的权重用来从 context 中提取最相关的信息,其中 context 压缩到一个固定 size 的向量。

  • 2.在文本领域,context 中的表示在时间上是动态的。所以当前时间步的 attention 权重依赖于之前时间步的向量。

  • 3.它们通常是单向的,用 query 查询内容段落或图像。

Model Architecture

BiDAF 相比传统的将 attention 应用于 MC 任务作出如下改进:

  • First, our attention layer is not used to summarize the context paragraph into a fixed-size vector. Instead, the attention is computed for every time step, and the attended vector at each time step, along with the representations from previous layers, is allowed to flow through to the subsequent modeling layer. This reduces the information loss caused by early summarization.

1)并没有把 context 编码到固定大小的向量表示中,而是让每个时间步计算得到的 attended vactor 可以流动(在 modeling layer 通过 biLSTM 实现)这样可以减少早期加权和造成的信息丢失。

  • Second, we use a memory-less attention mechanism. That is, while we iteratively compute attention through time as in Bahdanau et al. (2015), the attention at each time step is a function of only the query and the context paragraph at the current time step and does not directly depend on the attention at the previous time step.

2)memory-less,在每一个时刻,仅仅对 query 和当前时刻的 context paragraph 进行计算,并不直接依赖上一时刻的 attention.

We hypothesize that this simplification leads to the division of labor between the attention layer and the modeling layer. It forces the attention layer to focus on learning the attention between the query and the context, and enables the modeling layer to focus on learning the interaction within the query-aware context representation (the output of the attention layer). It also allows the attention at each time step to be unaffected from incorrect attendances at previous time steps.

也就是对 attention layer 和 modeling layer 进行分工,前者关注于 context 和 query 之间的交互。而后者则关注于 query-aware context 中词于词之间的交互,也就是加权了 attention weights 之后的 context 表示。这使得 attention 在每个时间步不受之前错误的影响。

  • Third, we use attention mechanisms in both directions, query-to-context and context-to-query, which provide complimentary information to each other.

计算了 query-to-context(Q2C) 和 context-to-query(C2Q)两个方向的 attention 信息,认为 C2Q 和 Q2C 实际上能够相互补充。实验发现模型在开发集上去掉 C2Q 与 去掉 Q2C 相比,分别下降了 12 和 10 个百分点,显然 C2Q 这个方向上的 attention 更为重要


Character Embedding Layer and Word Embedding Layer -> Contextual Embedding Layer -> Attention Flow Layer -> Modeling Layer -> Output Layer

Character Embedding Layer and word embedding alyer

  • charatter embedding of each word using CNN.The outputs of the CNN are max-pooled over the entire width to obtain a fixed-size vector for each word.

  • pre-trained word vectors, GloVe

  • concatenation of them above and is passed to a two-layer highway networks.

context -> $X\in R^{d\times T}$

query -> $Q\in R^{d\times J}$

contextual embedding layer

model the temporal interactions between words using biLSTM.

context -> $H\in R^{2d\times T}$

query -> $U\in R^{2d\times J}$

前三层网络是在不同的粒度层面来提取 context 和 query 的特征。

attention flow layer

the attention flow layer is not used to summarize the query and context into single feature vectors. Instead, the attention vector at each time step, along with the embeddings from previous layers, are allowed to flow through to the subsequent modeling layer.

输入是 H 和 G,输出是 query-aware vector G, 以及上一层的 contextual layer.

这一层包含两个 attention,Context-to-query Attention 和 Query-to-context Attention. 它们共享相似矩阵 $S\in R^{T\times J}$(不是简单的矩阵相乘,而是类似于 Dynamic Memory Networks 中的计算方式).

$$S_{tj}=\alpha(H_{:t},U_{:j})\in R$$

其中 $\alpha(h,u)=w_{(S)}^T[h,u,h\circ u]$, $w_{(S)}\in R^{6d}$

Context-to-query Attention:

计算对每一个 context word 而言哪些 query words 和它最相关。所以 计算 t-th context word 对应的 query 每个词的权重:

$$a_t=softmax(S_{t:})\in R^J$$

然后将权重赋予到 query 上然后再加权求和(叠加赋予了权重的 query 中的每一个词),得到 t-th 对应的 query-aware query:

$$\tilde U_{:t}=\sum_j a_{tj}U_{:j}\in R^{2d}$$

然后 context 中的每一个词都这样计算,$\tilde U\in R^{2d\times T}$

就是通过 context 和 query 计算相似性后,通过 sortmax 转化为概率,然后作为权重赋予到 query 上,得到 context 每一个词对应的 attended-query.

Query-to-context Attention:

跟 C2Q 一样计算相似矩阵 S 后,计算对每一个 query word 而言哪些 context words 和它最相关,这些 context words 对回答问题很重要。

先计算相关性矩阵每一列中的最大值,max function $max_{col}(S)\in R^T$, 然后softmax计算概率:

$$b=softmax(max_{col}(S))\in R^T$$

权重 b 表示与整个 query 比较之后,context 中每一个词的重要程度,然后与 context 加权和:

$$\tilde h = \sum_tb_tH_{:t}\in R^{2d}$$

在 tile T 次后得到 $\tilde H\in R^{2d\times T}$.

比较 C2Q 和 Q2C,显然 Q2C 更重要,因为最终我们要找的答案是 context 中的内容。而且两者的 attention 计算方式有区别是:对 query 进行加权和时,我们考虑的是 context 中的每一个词,而在对 context 进行加权和时,我们要考虑所有的 query 中相关性最大的词,是因为 context 中某个词只要与 query 中任何一个词有关,都需要被 attend.

将三个矩阵拼接起来,得到 G:

$$G_{:t}=\beta (H_{:t},\tilde U_{:t}, \tilde H_{:t})\in R^{d_G}$$

function $\beta$ 可以是 multi-layers perceptron. 在作者的实验中:

$$\beta(h,\tilde u,\tilde h)=[h;\tilde u;h\circ \tilde u;h\circ \tilde h]\in R^{8d\times T}$$

Modeling Layer

captures the interaction among the context words conditioned on the query.

使用 biLSTM, 单向 LSTM 的输出维度是d,所以最终输出: $M\in R^{2d\times T}$.

Output Layer

输出 layer 是基于应用确定的。如果是 QA,就从段落中找出 start p1 和 end p2.

计算 start index:


其中 $w_{(p^1)}\in R^{10d}$

计算 end index,将 M 通过另一个 biLSTM 处理,得到 $M^2\in R^{2d\times T}$




$$L(\theta)=-{1 \over N} \sum^N_i[log(p^1_{y_i^1})+log(p^2_{y_i^2})]$$

$\theta$ 包括参数:

  • the weights of CNN filters and LSTM cells

  • $w_{S}$,$w_{p^1},w_{p^2}$

$y_i^1,y_i^2$ 表示i样本中开始可结束位置在 context 中的 index.

$p^1,p^2\in R^T$ 是经过 softmax 得到的概率,可以将 gold truth 看作是 one-hot 向量 [0,0,…,1,0,0,0],所以对单个样本交叉熵是:

$$- log(p^1_{y_i^1})-log(p^2_{y_i^2})$$


The answer span $(k; l)$ where $k \le l$ with the maximum value of $p^1_kp^2_l$ is chosen, which can be computed in linear time with dynamic programming.

论文笔记 memory networks

Memory Networks 相关论文笔记。

  • Memory Network with strong supervision
  • End-to-End Memory Network
  • Dynamic Memory Network

Paper reading 1: Memory Networks, Jason Weston


RNNs 将信息压缩到final state中的机制,使得其对信息的记忆能力很有限。而memory work的提出就是对这一问题进行改善。

However, their memory (encoded by hidden states and weights) is typically too small, and is not compartmentalized enough to accurately remember facts from the past (knowledge is compressed into dense vectors). RNNs are known to have difficulty in performing memorization.

Memory Networks 提出的基本动机是我们需要 长期记忆(long-term memory)来保存问答的知识或者聊天的语境信息,而现有的 RNN 在长期记忆中表现并没有那么好。

Memory Networks

four components:

  • I:(input feature map)

把输入映射为特征向量,可以包括各种特征工程,比如parsing, coreference, entity resolution,也可以是RNN/LSTM/GRU。通常以句子为单位,将sentence用向量表示,一个句子对应一个sparse or dense feature vector.

  • G:(generalization)

使用新的输入数据更新 memories

  • O:(output feature map)

给定新的输入和现有的 memory state,在特征空间里产生输出

  • R:(response)



1.I component: :encode input text to internal feature representation.

可以选择多种特征,比如bag of words, RNN encoder states, etc.

2.G component: generalization 就是结合 old memories和输入来更新 memories. $m_i=G(m_i, I(x),m), ∀i$

最简单的更新memory的方法是 $m_{H(x)}=I(x)$, $H(x)$ 是一个寻址函数slot selecting function,G更新的是 m 的index,可以把新的memory m,也就是新的输入 I(x) 保存到下一个空闲的地址 $m_n$ 中,并不更新原有的memory. 更复杂的 G 函数可以去更新更早的memory,甚至是所有的memory.

这里的新的input,如果在QA中就是question 和 old memmory的组合 $[I(x), m_i]$.

3.O component: reading from memories and performing inference, calculating what are the relevant memories to perform a good response.





output: $[q,o_1, o_2]$ 也是module R的输入.

$s_O$ is a function that scores the match between the pair of sentences x and mi. $s_O$ 用来表征 question x 和 记忆 $m_i$ 的相关程度。


$s_O$ 表示问题q和当前memory m的相关程度

U:bilinear regression参数,相关事实的 $qUU^Tm_{true}$ 的score高于不相关事实的分数 $qUU^Tm_{random}$

4.R component : 对 output feature o 进行解码,得到最后的response: r=R(o)

$$r=argmax_{w\in W}s_R([q,m_{o_1},m_{o_2}],w)$$

W 是词典,$s_R$ 表示与output feature o 最相关的单词。

$s_R$ 和 $s_O$ 的形式是相同的。


Huge Memory 问题

如果memory太大,比如 Freebase or Wikipedia,

  • 可以按 entity 或者 topic 来存储 memory,这样 G 就不用在整个 memories 上操作了

  • 如果 memory 满了,可以引入 forgetting 机制,替换掉没那么有用的 memory,H 函数可以计算每个 memory 的分数,然后重写

  • 还可以对单词进行 hashing,或者对 word embedding 进行聚类,总之是把输入 I(x) 放到一个或多个 bucket 里面,然后只对相同 bucket 里的 memory 计算分数


损失函数如下,选定 2 条 supporting fact (k=2),response 是单词的情况:


minimize: $L_i = \sum_{j\ne y_i}max(0,s_j - s_{y_i}+\Delta)$

其中 $\overline f, \overline f’,\overline r$ 表示负采样。比如(8)式中r表示 true response, 而 $\overline r$ 表示随机抽样词典中的其他词。


(6) 有没有挑选出正确的第一句话

(7) 正确挑选出了第一句话后能不能正确挑出第二句话

(6)+(7) 合起来就是能不能挑选出正确的语境,用来训练 attention 参数

(8) 把正确的 supporting fact 作为输入,能不能挑选出正确的答案,来训练 response 参数

Paper reading 2 End-To-End Memory Networks



The model in that work was not easy to train via backpropagation, and required supervision at each layer of the network.

这篇论文可以看作是上一篇论文memory networks的改进版。

Our model can also be seen as a version of RNNsearch with multiple computational steps (which we term “hops”) per output symbol.

也可以看做是将multiple hops应用到RNNsearch这篇论文上 Neural Machine Translation by Jointly Learning to Align and Translate

Model architecture

Single layer


  • input: $x_1,…,x_i$

  • query: q

  • answer: a



  • memory vector {$m_i$}: ${x_i}\stackrel A\longrightarrow {m_i}$

  • internal state u: $q\stackrel B \longrightarrow u$

2.计算attention,也就是query的向量表示u,和input中各个sentence的向量表示 $m_i$ 的匹配度。compute the match between u and each memory mi by taking the inner product followed by a softmax.


p is a probability vector over the inputs.

3.得到context vector

  • output vector: ${x_i}\stackrel C\longrightarrow {c_i}$

The response vector from the memory o is then a sum over the transformed inputs ci, weighted by the probability vector from the input:

$$o = \sum_ip_ic_i$$

和 Memory Networks with Strong Supervision 版本不同,这里的 output 是加权平均而不是一个 argmax


$$\hat a =softmax(Wu^{k+1})= softmax(W(o^k+u^k))$$

W可以看做反向embedding,W.shape=[embed_size, V]

5.对 $\hat a$ 进行解码,得到自然语言的response

$$\hat a \stackrel C \longrightarrow a$$


A: intput embedding matrix

C: output embedding matrix

W: answer prediction matrix

B: question embedding matrix


这里的 memory {$m_i$} 直接用于输出向量 $c_i$. 其实我也疑惑,为啥要重新用一个output embedding C,直接用 $m_i$ 不好吗。其实这些小tricks也说不准好不好,都是试出来的吧,因为怎么说都合理。。。

Multiple Layers/ Multiple hops

多层结构(K hops)也很简单,相当于做多次 addressing/多次 attention,每次 focus 在不同的 memory 上,不过在第 k+1 次 attention 时 query 的表示需要把之前的 context vector 和 query 拼起来,其他过程几乎不变。




  • input components: 就是将query和sentences映射到特征空间中

  • generalization components: 更新memory,这里的memory也是在变化的,${m_i}=AX$, 但是embedding matrix A 是逐层变化的

  • output components: attention就是根据inner product后softmax计算memory和query之间的匹配度,然后更新input,也就是[u_k,o_k], 可以是相加/拼接,或者用RNN. 区别是,在上一篇论文中是argmax,$o_2=O_2(q,m)=argmax_{i=1,2,..,N}s_O([q,o_1],m_i)$, 也就是选出匹配程度最大的 memory $m_i$, 而这篇论文是对所有的memory进行加权求和

  • response components: 跟output components类似啊,上一篇论文是与词典中所有的词进行匹配,求出相似度最大的 $r=argmax_{w\in W}s_R([q,m_{o_1},m_{o_2}],w)$,而这篇论文是 $\hat a=softmax(Wu^{k+1})=softmax(W(u^k+o^k))$ 最小化交叉熵损失函数训练得到 answer prediction matrix W.

Overall, it is similar to the Memory Network model in [23], except that the hard max operations within each layer have been replaced with a continuous weighting from the softmax.


每一层都有 mebedding matrices $A^k, C^k$,用来embed inputs {$x_i$},为了减少训练参数.作者尝试了以下两种情况:

  1. Adjacent
  • 上一层的output embedding matrix 是下一层的 input embedding matrix, 即 $A^{k+1}=C^k$

  • 最后一层的output embedding 可用作 prediction embedding matrix, 即 $W^T=C^k$

  • question embedding matrix = input embedding matrix of the first layer, $B=A^1$

  1. Layer-wise (RNN-like)
  • $A^1=A^2=…=A^k, C^1=C^2=…C^k$

  • $u^{k+1} = Hu^k+o^k$



数据集来源:Towards AI-complete question answering: A set of prerequisite toy tasks

总共有 20 QA tasks,其中每个task有 $I(I\le 320)$ 个sentence {$x_i$}, 词典大小 V=170, 可以看做这是个玩具级的任务。每个task有1000个problems

Modle details

Sentence representations


1.Bag of words(BOW) representation





2.encodes the position of words within the sentence 考虑词序的编码

$$m_i=\sum_jl_j\cdot Ax_{ij}$$



查看源码时发现很多代码的position encoder与原paper不一样,比如domluna/memn2n中公式是:

$$l_{kj} = 1+4(k- (d+1)/2)(j-(J+1)/2)/d/J$$

原本词 $x_{ij}$ 的向量表示就是embeded后的 $Ax_{ij},(shape=[1, embed_size])$, 但现在要给这个向量加一个权重 $l_j$,而且这个权重不是一个值,而是一个向量,对 $Ax_{ij}$ 中每一个维度的权重也是不一样的。

令J=20, d=50. 具体两个公式的差别可以查看




好像跟句子的结构相关,北大有篇相关的论文A Position Encoding Convolutional Neural Network Based on Dependency Tree for Relation Classification

其中 J 表示sentence的长度,d表示 dimension of the embedding. 这种sentence representation称为 position encoding(PE).也就是词序会影响memory $m_i$.

position encoding 代码实现


def position_encoding(sentence_size, embedding_size):


Position Encoding described in section 4.1 [1]


encoding = np.ones((embedding_size, sentence_size), dtype=np.float32)

le = embedding_size+1

ls = sentence_size + 1

for k in range(1, le):

for j in range(1, ls):

# here is different from the paper.

# the formulation in paper is: l_{kj}=(1-j/J)-(k/d)(1-2j/J)

# here the formulation is: l_{kj} = 1+4(k- (d+1)/2)(j-(J+1)/2)/d/J,

# 具体表现可查看 https://www.wolframalpha.com/input/?i=1+%2B+4+*+((y+-+(20+%2B+1)+%2F+2)+*+(x+-+(50+%2B+1)+%2F+2))+%2F+(20+*+50)+for+0+%3C+x+%3C+50+and+0+%3C+y+%3C+20

encoding[k-1, j-1] = (k - (embedding_size+1)/2) * (j - (sentence_size+1)/2)

encoding = 1 + 4 * encoding / embedding_size / sentence_size

# Make position encoding of time words identity to avoid modifying them

encoding[:, -1] = 1.0 # 最后一个sentence的权重都为1

return np.transpose(encoding) # [sentence_size, embedding_size]

Temporal Encoding



其中 $T_A(i)$ is the ith row of a special matrix $T_A$ that encodes temporal information. 用一个特殊的矩阵 $T_A$ 来编码时间信息。$T_A(i)$ i表示第i个sentence的包含时间信息??

同样的output embedding:


Learning time invariance by injecting random noise

we have found it helpful to add “dummy” memories to regularize TA.

Training Details

1.learning rate decay

2.gradient clip

3.linear start training

4.null padding, zero padding



Paper reading 3 Ask Me Anything: Dynamic Memory Networks for Natural Language Processing


Most tasks in natural language processing can be cast into question answering (QA) problems over language input.

大部分的自然语言处理的任务都可以看作是QA问题,比如QA, sentiment analysis, part-of-speech tagging.

Model Architecture


  • Input Module: 将输入文本编码为distribution representations

  • Question Module: 将question编码为distribution representations

  • Episodic Memory Module: 通过attention机制选择focus on输入文本中的某些部分,然后生成memory vector representation.

  • Answer Module: 依据the final memory vector生成answer

Detailed visualization:

Input Module


1.输入是single sentence,那么input module输出的就是通过RNN计算得到的隐藏状态 $T_C= T_I$, $T_I$ 表示一个sentence中的词的个数。

2.输入是a list of sentences,在每个句子后插入一个结束符号 end-of-sentence token, 然后每个sentence的final hidden作为这个sentence的representation. 那么input module输出 $T_C$, $T_C$等于sequence的sentence个数。


Question Module

同样的使用GRU编码,在t时间步, 隐藏状态


L代表embedding matrix.

最后输出 final hidden state.


$T_Q$ 是question的词的个数。

Episodic Memory Module

由 internal memory, attention mechansim, memory update mechanism 组成。 输入是 input module 和 question module 的输出。

把 input module 中每个句子的表达(fact representation c)放到 episodic memory module 里做推理,使用 attention 原理从 input module 中提取相关信息,同样有 multi-hop architecture。

1.Needs for multiple Episodes: 通过迭代使得模型具有了传递推理能力 transitive inference.

2.Attention Mechanism: 使用了一个gating function作为attention机制。相比在 end-to-end MemNN 中attention使用的是linear regression,即对inner production通过softmax求权重。 这里使用一个两层前向神经网络 G 函数.


$c_t$ 是candidate fact, $m_{i-1}$ 是previous memory, question q. t 表示sentence中的第t时间步,i表示episodic的迭代次数。

这里作者定义了 a large feture $z(c,m,q)$ 来表征input, memory, question之间的相似性。

$$z_t^i=[c_t, m^{i-1},q, c_t\circ q,c_t\circ m^{i-1},|c_t-q|,|c_t-m^{i-1}|, c_t^TW^{(b)}q, c_t^TW^{(b)}m^{i-1}]$$

总的来说,就是根据向量内积,向量相减来表示相似度。 跟cs224d-lecture16 dynamic Memory networkRichard Socher本人讲的有点区别,不过这个既然是人工定义的,好像怎么说都可以。

然后通过G函数,也就是两层前向神经网络得到一个scale score.

$$G = \sigma(W^{(2)}tanh(W^{(1)}z_i^t+b^{(1)})+b^{(2)})$$

将 $c_t$, $m^{i-1}, q 带入到G函数,即可求得$$g_i^t$,也就是candidate fact $c_i$ 的score.

计算完每一次迭代后的分数后,来更新episode $e^i$, 相当于 context vector,

soft attention

在之前的attention机制中,比如cs224d-lecture10-机器翻译和注意力机制介绍的attention得到的context vector,在end-to-end MemNN中attention也是fact representation的加权求和。

attention based GRU

但这篇论文中应用了GRU,对fact representation c 进行处理,然后加上gate


所以这里的GRU应该是 $T_C$步吧??

每次迭代的context vector是对 input module 的输出进行 attention-based GRU编码的最后的隐藏状态:



这部分attention mechanism目的就是生成episode $e^i$,$e^i$ 是第i轮迭代的所有input相关信息的summary.也就是 context vector,将input text压缩到一个向量表示中,end-to-end MemNN用了soft attention,就是加权求和。而这里用了GRU,各个时间步的权重不是直接相乘,而是作为一个gate机制。

3.Memory Update Mechanism

上一步计算的episode $e^i$ 以及上一轮迭代的memory $m^{i-1}$ 作为输入来更新memory $m_i$


$m^0=q$, 所以这里的GRU是单步的吧

经过 $T_M$ 次迭代: $m=m^{T_M}$, 也就是episodic memory module的输出,即answer module的输入。

在end-to-end MemNN的memory update中,$u_{k+1}=u^k+o^k$, 而在这篇论文中,如果也采用这种形式的话就是 $m^{i}=e^i+m^{i-1}$,但作者采用了 RNN 做非线性映射,用 episode $e_i$ 和上一个 memory $m_{i−1}$ 来更新 episodic memory,其 GRU 的初始状态包含了 question 信息,$m_0=q$。

4.Criteria for stopping

Episodic Memory Module 需要一个停止迭代的信号。一般可以在输入中加入一个特殊的 end-of-passes 的信号,如果 gate 选中了该特殊信号,就停止迭代。对于没有显性监督的数据集,可以设一个迭代的最大值。

Answer Module

使用了GRU的decoder。输入是question module的输出q和上一个时刻的hidden state $a_{t-1}$,初始状态是episodic memory module的输出 $a_0=m^{T_M}$.





使用 cross-entroy 作为目标函数。如果 数据集有 gate 的监督数据,还可以将 gate 的 cross-entroy 加到总的 cost上去,一起训练。训练直接使用 backpropagation 和 gradient descent 就可以。

总结:对比上一篇论文End-to-end memory networks

  • input components: end2end MemNN 采用embedding,而DMN使用GRU

  • generalization components: 也就是memory update,End2End MemNN采用线性相加 $u^{k+1}=u^k+o^k$,其中的 $o^k$ 就是经过attention之后得到的memory vector

  • output components: end2end MemNN采用的是对比memory和query,用内积求相似度,然后softmax求权重,最后使用加权求和得到context vector. 而DMN采用的是人工定义相似度的表示形式,然后用两层前向神经网络计算得到score,再对score用softmax求权重,再然后把权重当做gate机制,使用GRU求context vector

  • response components: end2end MemNN 直接使用最后的 top memory layer 预测,而DMN是把top memory 当做init hidden state


Paper reading 4 DMN+

paper:Dynamic Memory Networks for Visual and Textual Question Answering (2016)


提出了DMN+,是DMN的改进版,同时将其应用到 Visual Question Answering 这一任务上。

However, it was not shown whether the architecture achieves strong results for question answering when supporting facts are not marked during training or whether it could be applied to other modalities such as images.

这段话是描述DMN的缺点的,在没有标注 supporting facts的情况下表现不好。但是DMN貌似也并不需要标注 supporting facts啊。。。

Like the original DMN, this memory network requires that supporting facts are labeled during QA training. End-toend memory networks (Sukhbaatar et al., 2015) do not have this limitation.

Based on an analysis of the DMN, we propose several improvements to its memory and input modules. Together with these changes we introduce a novel input module for images in order to be able to answer visual questions.

这篇文章对DMN中的 input module进行了修改,并且提出了新的模型架构适用于图像的。

DMN 存在的两个问题:


只用 word level 的 GRU,很难记忆远距离 supporting sentences 之间的信息。

总的来说这篇文章贡献主要还是在应用到图像上了,至于作者所说的 input module的改进,只是为了减少计算量,而且改进版中的 bi-RNN 和 position encoding 都是在别人的论文中出现了的。

Model Architecture

同DMN一样,也分为 input module, question module, episodic module 和 answer module.

Input Module

input module for text QA

主要分为两个组件: sentence reader 和 input fusion layer.

sentence reader: 用encoding position代替RNN对单个sentence进行编码。用 positional encoding 的原因是在这里用 GRU/LSTM 编码句子计算量大而且容易过拟合(毕竟 bAbI 的单词量很小就几十个单词。。),这种方法反而更好。

input fusion layer: 使用 bi-directional GRU 来得到context 信息,兼顾过去和未来的信息。

总的来说: DMN+ 把 single GRU 替换成了类似 hierarchical RNN 结构,一个 sentence reader 得到每个句子的 embedding,一个 input infusion layer 把每个句子的 embedding 放入另一个 GRU 中,得到 context 信息,来解决句子远距离依赖的问题。

input module for VQA

1.Local region feature extraction:

获取局部特征信息,使用VGG预训练得到的特征。局部特征 feature vector 通过一个linear layer 和 tanh activation 得到 feature embedding.

2.Input fusion layer:

将 feature embedding 放入到 bi-GRU 中。

Without global information, their representational power is quite limited, with simple issues like object scaling or locational variance causing accuracy problems. 强调了为什么要使用 input fusion layer.

Question Module

这部分跟DMN是一样的, question 都是文本,用RNN编码。

Episodic Module

score mechanism

input module 的输出是:

$$\overleftrightarrow F=[\overleftrightarrow f_1, …,\overleftrightarrow f_N]$$


$$z_i^t=[\overleftrightarrow f_i\circ q,\overleftrightarrow f_i\circ m^{t-1},|\overleftrightarrow f_i-q|,|\overleftrightarrow f_i-m^{i-1}|]$$

这里与前面DMN的公式有点区别,就是这里的i表示input module中的时间步, t 表示episodic迭代次数。


$$G = W^{(2)}tanh(W^{(1)}z_i^t+b^{(1)})+b^{(2)}$$

但是这里不是使用 sigmoid 函数来求的 score,而是使用softmax 来求score $g_i^t$.


attention mechanism

比较了 soft attention 和 attention-based-GRU.相比DMN那篇论文,这里给出了详细的比较。

soft attention, 就是简单的加权求和。

$$c^t=\sum_{i=1}^Ng_i^t\overleftrightarrow f_i$$


感觉简单的attention已经很好了吧。。前面 $\overleftrightarrow f_i$ 不就是考虑了词序信息的么,然后再用GRU对 $\overleftrightarrow f_i$ 处理不会过拟合吗???

attention based GRU

使用attention gate $g_i^t$ 代替 update gate $u_i$. 我们知道 $u_i$ 是通过 current input 和 previous hidden state得到的。 而使用 attention gate $g_i^t$ 能够考虑到 question 和 previous memory. 因为我们这里是要更新memory, 所以这样很合理呀。。厉害了

$$h_i=g_i^t\circ \tilde h_i+(1-g_i^t)\circ h_{i-1}$$

memory update mechanism

在DMN中,更新memory在基于 previous memory 和 当前的 context vector 的GRU编码得到的。 DMN+采用的是将 previous memory $m^{t-1}$, 当前 context $c^t$,和question q 拼接起来,然后通过全连接层,以及relu激活函数得到的:

$$m_t = ReLU(W^t[m^{t-1},c^t,q]+b)$$


Answer Module