# 论文笔记-CoQA

## Motivation

We introduce CoQA, a novel dataset for building Conversational Question Answering systems.1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains.

CoQA, 对话式阅读理解数据集。从 7 个不同领域的 8k 对话中获取的 127k 问答对。

The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage.

We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

CoQA 跟传统的 RC 数据集所面临的挑战不一样，主要是指代和推理。

We ask other people a question to either seek or test their knowledge about a subject. Depending on their answer, we follow up with another question and their answer builds on what has already been discussed. This incremental aspect makes human conversations succinct. An inability to build up and maintain common ground in this way is part of why virtual assistants usually don’t seem like competent conversational partners.

## Introduction

In CoQA, a machine has to understand a text passage and answer a series of questions that appear in a conversation. We develop CoQA with three main goals in mind.

The first concerns the nature of questions in a human conversation. Posing short questions is an effective human conversation strategy, but such questions are a pain in the neck for machines.

The second goal of CoQA is to ensure the naturalness of answers in a conversation. Many existing QA datasets restrict answers to a contiguous span in a given passage, also known as extractive answers (Table 1). Such answers are not always natural, for example, there is no extractive answer for Q4 (How many?) in Figure 1. In CoQA, we propose that the answers can be free-form text (abstractive answers), while the extractive spans act as rationales for the actual answers. Therefore, the answer for Q4 is simply Three while its rationale is spanned across multiple sentences.

The third goal of CoQA is to enable building QA systems that perform robustly across domains. The current QA datasets mainly focus on a single domain which makes it hard to test the generalization ability of existing models.

## Dataset collection

1. It consists of 127k conversation turns collected from 8k conversations over text passages (approximately one conversation per

passage). The average conversation length is 15 turns, and each turn consists of a question and an answer.

1. It contains free-form answers. Each answer has an extractive rationale highlighted in the passage.
1. Its text passages are collected from seven diverse domains — five are used for in-domain evaluation and two are used for out-of-domain

evaluation.

Almost half of CoQA questions refer back to conversational history using coreferences, and a large portion requires pragmatic reasoning making it challenging for models that rely on lexical cues alone.

The best-performing system, a reading comprehension model that predicts extractive rationales which are further fed into a sequence-to-sequence model that generates final answers, achieves a F1 score of 65.1%. In contrast, humans achieve 88.8% F1, a superiority of 23.7% F1, indicating that there is a lot of headroom for improvement.

Baseline 是将抽取式阅读理解模型转换成 seq2seq 形式，然后从 rationale 中获取答案，最终得到了 65.1% 的 F1 值。

### question and answer collection

We want questioners to avoid using exact words in the passage in order to increase lexical diversity. When they type a word that is already present in the passage, we alert them to paraphrase the question if possible.

questioner 提出的问题应尽可能避免使用出现在 passage 中的词，这样可以增加词汇的多样性。

For the answers, we want answerers to stick to the vocabulary in the passage in order to limit the number of possible answers. We encourage this by automatically copying the highlighted text into the answer box and allowing them to edit copied text in order to generate a natural answer. We found 78% of the answers have at least one edit such as changing a word’s case or adding a punctuation.

### passage collection

Not all passages in these domains are equally good for generating interesting conversations. A passage with just one entity often result in questions that entirely focus on that entity. Therefore, we select passages with multiple entities, events and pronominal references using Stanford CoreNLP (Manning et al., 2014). We truncate long articles to the first few paragraphs that result in around 200 words.

Table 2 shows the distribution of domains. We reserve the Science and Reddit domains for out-ofdomain evaluation. For each in-domain dataset, we split the data such that there are 100 passages in the development set, 100 passages in the test set, and the rest in the training set. For each out-of-domain dataset, we just have 100 passages in the test set.

In domain 中包含 Children, Literature, Mid/HIgh school, News, Wikipedia. 他们分出 100 passage 到开发集(dev dataset), 其余的在训练集 (train dataset). out-of-diomain 包含 Science Reddit ，分别有 100 passage 在开发集中。

test dataset:

### Collection multiple answers

Some questions in CoQA may have multiple valid answers. For example, another answer for Q4 in Figure 2 is A Republican candidate. In order to

account for answer variations, we collect three additional answers for all questions in the development and test data.

In the previous example, if the original answer was A Republican Candidate, then the following question Which party does he

belong to? would not have occurred in the first place. When we show questions from an existing conversation to new answerers, it is likely they will deviate from the original answers which makes the conversation incoherent. It is thus important to bring them to a common ground with the original answer.

We achieve this by turning the answer collection task into a game of predicting original answers. First, we show a question to a new answerer, and when she answers it, we show the original answer and ask her to verify if her answer matches the original. For the next question, we ask her to guess the original answer and verify again. We repeat this process until the conversation is complete. In our pilot experiment, the human F1 score is increased by 5.4% when we use this verification setup.

## Dataset Analysis

What makes the CoQA dataset conversational compared to existing reading comprehension datasets like SQuAD? How does the conversation flow from one turn to the other? What linguistic phenomena do the questions in CoQA exhibit? We answer these questions below.

1. 指代词(he, him, she, it, they)出现的更为频繁， SQuAD 则几乎没有。

2. SQuAD 中 what 几乎占了一半，CoQA 中问题类型则更为多样， 比如 did, was, is, does 的频率很高。

3. CoQA 的问题更加简短。见图 3.

4. answer 有 33% 的是 abstractive. 考虑到人工因素，抽取式的 answer 显然更好写，所以这高于作者预期了。yes/no 的答案也有一定比重。

### Conversation Flow

A coherent conversation must have smooth transitions between turns.

### Linguistic Phenomena

Relationship between a question and its passage：

• lexical match: question 和 passage 中至少有一个词是匹配的。

• Paraphrasing: 解释型。虽然 question 没有与 passage 的词，但是确实对 rationale 的一种解释，也就是换了一种说法，当作问题提出了。通常这里面包含： synonymy(同义词), antonymy(反义词), hypernymy(上位词), hyponymy(下位词) and negation(否定词).

• Pragmatics: 需要推理的。

Relationship between a question and its conversation history：

• No coref

• Explicit coref.

• Implicit coref.

# 论文笔记-CNN与自然语言处理

• Embedding: 使用预训练的中文词向量。

• Encoder: 基于 Bi-GRU 对 passage,query 和 alternatives 进行编码处理。

• Attention: 用 trilinear 的方式，并 mask 之后得到相似矩阵，然后采用类似于 BiDAF 中的形式 bi-attention flow 得到 attened passage.

• contextual: 用 Bi-GRU 对 attened passage 进行编码，得到 fusion.

• match 使用 attention pooling 的方式将 fusion 和 enc_answer 转换为单个 vector. 然后使用 cosin 进行匹配计算得到最相似的答案。

• 可以用 ELMO 或 wordvec 先对训练集进行预训练得到自己的词向量。

• attention 层可以使用更丰富的方式，很多paper 中也有提到。甚至可以加上人工提取的特征。比如苏剑林 blog 中提到的。

• 还有个很重要的就是 match 部分， attention pooling 是否可以换成其他更好的方式？

# ConvS2S

Facebook 的这篇 paper 就改变了这些传统的思维，不仅用 CNN 编码全局信息，而且还能 decoder.

## Motivation

Multi-layer convolutional neural networks create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers.

Hierarchical structure provides a shorter path to capture long-range dependencies compared to the chain structure modeled by recurrent networks, e.g. we can obtain a feature representation capturing relationships within a window of n words by applying only O(n/k) convolutional operations for kernels of width k, compared to a linear number O(n) for recurrent neural networks.

Inputs to a convolutional network are fed through a constant number of kernels and non-linearities, whereas recurrent networks apply up to n operations and non-linearities to the first word and only a single set of operations to the last word. Fixing the number of nonlinearities applied to the inputs also eases learning.

## Ｍodel Architecture

• position embedding

• convolution block structure

• Multi-step attention

### convolution blocks

$$h_l=(XW+b)\otimes \sigma(XV+c)$$

The output of each layer is a linear projection X ∗ W + b modulated by the gates σ(X ∗ V + c). Similar to LSTMs, these gates multiply each element of the matrix X ∗W+b

and control the information passed on in the hierarchy.

$$h_i^l=tanh(XW+b)\otimes \sigma(XV+c)$$

residual connection: 为了得到更 deep 的卷积神经网络，作者增加了残差链接。

$$h_i^l=v(W^l[h_{i-k/2}^{l-1},…,h_{i+k/2}^{l-1}]+b_w^l)+h_i^{l-1}$$

For instance, stacking 6 blocks with k = 5 results in an input field of 25 elements, i.e. each output depends on 25 inputs. Non-linearities allow the networks to exploit the full input field, or to focus on fewer elements if needed.

• ConvS2S 是 1D 卷积，kernel 只是在时间维度上平移，且 stride 的固定 size 为1,这是因为语言不具备图像的可伸缩性，图像在均匀的进行降采样后不改变图像的特征，而一个句子间隔着取词，意思就会改变很多了。

• 在图像中一个卷积层往往有多个 filter，以获取图像不同的 pattern，但是在 ConvS2S 中，每一层只有一个 filter。一个句子进入 filter 的数据形式是 [1, n, d]. 其中 n 为句子长度， filter 对数据进行 n 方向上卷积，而 d 是词的向量维度，可以理解为 channel，与彩色图片中的 rgb 三个 channel 类似。

Facebook 在设计时，并没有像图像中常做的那样，每一层只设置一个 filter。这样做的原因，一是为了简化模型，加速模型收敛，二是他们认为一个句子的 pattern 要较图像简单很多，通过每层设置一个 filter，逐层堆叠后便能抓到所有的 pattern. 更有可能的原因是前者。因为在 transorfmer 中，multi-head attention 多头聚焦取得了很好的效果，说明一个句子的 pattern 是有多个的.

For encoder networks we ensure that the output of the convolutional layers matches the input length by padding the input at each layer. However, for decoder networks we have to take care that no future information is available to the decoder (Oord et al., 2016a). Specifically, we pad the input by k − 1 elements on both the left and right side by zero vectors, and then remove k elements from the end of the convolution output.

### Multi-step Attention

$$d_i^l=W_d^lh_i^l+b_d^l+g_i$$

$$a_{ij}^l=\dfrac{exp(d_i^l\cdot z_j^u)}{\sum_{t=1}^mexp(d_i^l\cdot z_j^u)}$$

$$c_i^l=\sum_{j=1}^ma_{ij}^l(z_j^u+e_j)$$

• 在训练阶段是 teacher forcing, 卷积核 $W_d^l$ 在 target sentence $h^l$ 上移动做卷积得到 $(W_d^lh_i^l + b_d^l)$，类似与 rnn-decoder 中的隐藏状态。然后加上上一个词的 embedding $g_i$,得到 $d_i^l$.

• 与 encdoer 得到的 source sentence 做交互，通过 softmax 得到 attention weights $a_{ij}^l$.

• 得到 attention vector 跟 rnn-decoder 有所不同，这里加上了 input element embedding $e_j$.

We found adding e_j to be beneficial and it resembles key-value memory networks where the keys are the z_j^u and the values are the z^u_j + e_j (Miller et al., 2016). Encoder outputs z_j^u represent potentially large input contexts and e_j provides point information about a specific input element that is useful when making a prediction. Once c^l_i has been computed, it is simply added to the output of the corresponding decoder layer h^l_i.

$z_j^u$ 表示更丰富的信息，而 $e_j$ 能够能具体的指出输入中对预测有用的信息。还是谁用谁知道吧。。

This can be seen as attention with multiple ’hops’ (Sukhbaatar et al., 2015) compared to single step attention (Bahdanau et al., 2014; Luong et al., 2015; Zhou et al., 2016; Wu et al., 2016). In particular, the attention of the first layer determines a useful source context which is then fed to the second layer that takes this information into account when computing attention etc. The decoder also has immediate access to the attention history of the k − 1 previous time steps because the conditional inputs $c^{l-1}_{i−k}, . . . , c^{l-1}i$ are part of $h^{l-1}{i-k}, . . . , h^{l-1}_i$ which are input to $h^l_i$. This makes it easier for the model to take into account which previous inputs have been attended to already compared to recurrent nets where this information is in the recurrent state and needs to survive several non-linearities. Overall, our attention mechanism considers which words we previously attended to (Yang et al., 2016) and performs multiple attention ’hops’ per time step. In Appendix §C, we plot attention scores for a deep decoder and show that at different layers, different portions of the source are attended to.

# FAST READING COMPREHENSION WITH CONVNETS

Gated Linear Dilated Residual Network (GLDR):

a combination of residual networks (He et al., 2016), dilated convolutions (Yu & Koltun, 2016) and gated linear units (Dauphin et al., 2017).

## text understanding with dilated convolution

kernel:$k=[k_{-l},k_{-l+1},…,k_l]$, size=$2l+1$

input: $x=[x_1,x_2,…,x_n]$

dilation: d

$$(k*x)_ t=\sum_{i=-l}^lk_i\cdot x_{t + d\cdot i}$$

Repeated dilated convolution (Yu & Koltun, 2016) increases the receptive region of ConvNet outputs exponentially with respect to the network depth, which results in drastically shortened computation paths.

## model Architecture

The receptive field of this convolutional network grows

exponentially with depth and soon encompasses a long sequence, essentially enabling it to capture

similar long-term dependencies as an actual sequential model.

Convolutional BiDAF. In our convolutional version of BiDAF, we replaced all bidirectional LSTMs with GLDRs . We have two 5-layer GLDRs in the contextual layer whose weights are un-tied. In the modeling layer, a 17-layer GLDR with dilation 1, 2, 4, 8, 16 in the first 5 residual blocks is used, which results in a reception region of 65 words. A 3-layer GLDR replaces the bidirectional LSTM in the output layer. For simplicity, we use same-padding and kernel size 3 for all convolutions unless specified. The hidden size of all GLDRs is 100 which is the same as the LSTMs in BiDAF.

# 论文笔记-QANet

paper:

## Motivation

Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models.

encoder 编码方式仅仅由 卷积 和 自注意力 机制构成，没了 rnn 速度就是快。

The key motivation behind the design of our model is the following: convolution captures the local structure of the text, while the self-attention learns the global interaction between each pair of words.

we propose a complementary data augmentation technique to enhance the training data. This technique paraphrases the examples by translating the original sentences from English to another language and then back to English, which not only enhances the number of training instances but also diversifies the phrasing.

## Model

• an embedding layer

• an embedding encoder layer

• a context-query attention layer

• a model encoder layer

• an output layer.

the combination of convolutions and self-attention is novel, and is significantly better than self-attention alone and gives 2.7 F1 gain in our experiments. The use of convolutions also allows us to take advantage of common regularization methods in ConvNets such as stochastic depth (layer dropout) (Huang et al., 2016), which gives an additional gain of 0.2 F1 in our experiments.

CNN 和 self-attention 的结合比单独的 self-attention 效果要好。同时使用了 CNN 之后能够使用常用的正则化方式 dropout, 这也能带来一点增益。

### Input embedding layer

obtain the embedding of each word w by concatenating its word embedding and character embedding.

Each character is represented as a trainable vector of dimension p2 = 200, meaning each word can be viewed as the concatenation of the embedding vectors for each of its characters. The length of each word is either truncated or padded to 16. We take maximum value of each row of this matrix to get a fixed-size vector representation of each word.

### Embedding encoding layer

The encoder layer is a stack of the following basic building block: [convolution-layer × # + self-attention-layer + feed-forward-layer]

• convolution: 使用 depthwise separable convolutions 而不是用传统的 convolution，因为作者发现 it is memory efficient and has better generalization. 怎么理解这个，还得看原 paper. The kernel size is 7, the number of filters is d = 128.

Each of these basic operations (conv/self-attention/ffn) is placed inside a residual block, shown lower-right in Figure 1. For an input x and a given operation f, the output is f(layernorm(x))+x.

### Context-Query Attention Layer

content: $C={c_1, c_2,…,c_n}$

query: $Q={q_1,q_2,…q_m}$.

• content: [batch, content_n, embed_size]

• query: [batch, query_m, embed_size]

sim_matrix: [batch, content_n, query_m]

The similarity function used here is the trilinear function (Seo et al., 2016). $f(q,c)=W_0[q,c,q\circ c]$.

#### content-to-query

$A = \tilde SQ^T$, shape = [batch, content_n, embed_size]

#### query_to_content

Empirically, we find that, the DCN attention can provide a little benefit over simply applying context-to-query attention, so we adopt this strategy.

$\tilde S$.shape=[batch, content_n, query_m]

$\overline S^T$.shape=[batch, query_m, content_n]

$C^T$.shape=[batch, query_m, embed_size]

### Ouput layer

$$p^1=softmax(W_1[M_0;M_1]), p^2=softmax(W_2[M_0;M_2])$$

$$L(\theta)=-\dfrac{1}{N}\sum_i^N[log(p^1_{y^1})+log(p^2_{y^2})]$$

### QANet 哪里好，好在哪儿？

• separable conv 不仅参数量少，速度快，还效果好。将 sep 变成传统 cnn, F1 值减小 0.7.

• 去掉 CNN， F1值减小 2.7.

• 去掉 self-attention, F1值减小 1.3.

• layer normalization

• residual connections

• L2 regularization

# 论文笔记 Pointer Networks and copy mechanism

paper:

## Pointer Network

### Motivation

We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence.

Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence [1] and Neural Turing Machines [2], because the number of target classes in each step of the output depends on the length of the input, which is variable.

Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class.

It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output.

We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems – finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem – using training examples

alone.

Ptr-Net 可以用来学习类似的三个几何问题。

Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on.

Ptr-Net 不仅可以提升 seq2seq with attention,而且能够泛化到变化的 dictionayies.

• 一是，简单的 copy 在传统的方法中很难实现，而 Ptr-Net 则是直接从输入序列中生成输出序列。

• 而是，可以解决输出 dictionary 是变化的情况。普通的 Seq2Seq 的 output dictionary 大小是固定的，对输出中包含有输入单词(尤其是 OOV 和 rare word) 的情况很不友好。一方面，训练中不常见的单词的 word embedding 质量也不高，很难在 decoder 时预测出来，另一方面，即使 word embedding 很好，对一些命名实体，像人名等，word embedding 都很相似，也很难准确的 reproduce 出输入提到的单词。Point Network 以及在此基础上后续的研究 CopyNet 中的 copy mechanism 就可以很好的处理这种问题，decoder 在各 time step 下，会学习怎样直接 copy 出现在输入中的关键字。

### Model Architecture

#### sequence-to-sequence Model

$$p(C^P|P;\theta)=\sum_{i=1}^m(P)p_{\theta}(C_i|C_1,…,C_{i-1},P;\theta)$$

$$\theta^* = {argmax}{\theta}\sum{P,C^P}logp(C^P|P;\theta)$$

In this sequence-to-sequence model, the output dictionary size for all symbols $C_i$ is fixed and equal to n, since the outputs are chosen from the input. Thus, we need to train a separate model for each n. This prevents us from learning solutions to problems that have an output dictionary with a size that depends on the input sequence length.

#### Content Based Input Attention

This model performs significantly better than the sequence-to-sequence model on the convex hull problem, but it is not applicable to problems where the output dictionary size depends on the input.

Nevertheless, a very simple extension (or rather reduction) of the model allows us to do this easily.

#### Ptr-Net

seq2seq 模型的输出词是在固定的 dictionary 中进行 softmax，并选择概率最大的词，从而得到输出序列。但这里的输出 dictionary size 是取决于 input 序列的长度的。所以作者提出了新的模型，其实很简单。

$$u_j^i=v^Ttanh(W_1e_j+W_2d_i) ，j\in(1,…,n)$$

$$p(C_i|C_1,…,C_{i-1},P)=softmax(u^i)$$

i 表示decoder 的时间步，j 表示输入序列中的index. 所以$e_j$ 是 encoder 编码后的隐藏向量，$d_i$ 是 decoder 当前时间步 i 的隐藏向量。跟一般的 attention 基本上一致。只不过得到的 softmax 概率应用在输入序列 $C_1,…,C_{i-1}$ 上。

## CopyNet

### Motivation

We address an important problem in sequence-to-sequence (Seq2Seq) learning referred to as copying, in which certain segments in the input sequence are selectively replicated in the output sequence. A similar phenomenon is observable in human language communication. For example, humans tend to repeat entity names or even long phrases in conversation.

The challenge with regard to copying in Seq2Seq is that new machinery is needed to decide when to perform the operation.

For example:

• What to copy: 输入中的哪些部分应该被 copy?

• Where to paste: 应该把这部分信息 paste 到输出的哪个位置？

### Model Architecture

• From a cognitive perspective, the copying mechanism is related to rote memorization, requiring less understanding but ensuring high literal fidelity. 从认知学角度，copy机制近似于死记硬背，不需要太多的理解，但是要保证文字的保真度。

• From a modeling perspective, the copying operations are more rigid and symbolic, making it more difficult than soft attention mechanism to integrate into a fully differentiable neural model. 从模型的角度，copy 操作更加死板和符号化，这也使得相比 soft attention 机制更难整合到一个完整的可微分的神经模型中去。

Encoder:

LSTM 将 source sequence 转换为隐藏状态 M(emory) $h_1,…,h_{T_S}$.

Decoder:

• Prediction: COPYNET predicts words based on a mixed probabilistic model of two modes, namely the generate-mode and the copymode, where the latter picks words from the source sequence. 下一个词的预测由两种模式混合而成。生成 generate-mode 和 copy-mode. 后者就像前面 Ptr-Net 所说的，在 source sentence 获取词。
• State Update: the predicted word at time t−1 is used in updating the state at t, but COPYNET uses not only its word-embedding but also its corresponding location-specific hidden state in M (if any). 更新 decoder 中的隐藏状态时，t 时间步的隐藏状态不仅与 t-1 步生成词的 embedding vector 有关，还与这个词对应于 source sentence 中的隐藏状态的位置有关。
• Reading M: in addition to the attentive read to M, COPYNET also has“selective read” to M, which leads to a powerful hybrid of

content-based addressing and location-based addressing. 什么时候需要 copy，什么时候依赖理解来回答，怎么混合这两种模式很重要。

#### Prediction with Copying and Generation:$s_t\rightarrow y_t$

$$p(y_t|s_t,y_{t-1},c_t,M)=p(y_t,g|s_t,y_{t-1},c_t,M) + p(y_t,c|s_t,y_{t-1},c_t,M)$$

• Content-base

Attentive read from word-embedding

• location-base

Selective read from location-specific hidden units

$$p(y_t,g|\cdot)=\begin{cases} \dfrac{1}{Z}e^{\psi_g(y_t)}&y_t\in V\ 0,&y_t\in X \bigcap \overline V\ \dfrac{1}{Z}e^{\psi_g(UNK)},&y_t\notin V\cup X \end{cases}$$

$$p(y_t,c|\cdot)=\begin{cases}\dfrac{1}{Z}\sum_{j:x_j=y_t}{e^{\psi_c(x_j)}},&y_t\in X\0&\text {otherwise}\end{cases}$$

Z 是两种模型共享的归一化项，$Z=\sum_{v\in V\cup{UNK}}e^{\psi_g(v)}+\sum_{x\in X}e^{\psi_c(x)}$.

Generate-Mode:

$$\psi_g(y_t=v_i)=\nu_i^TW_os_t, v_i\in V\cup UNK$$

• $W_o\in R^{(N+1)\times d_s}$

• $\nu_i$ 是 $v_i$ 对应的 one-hot 向量. 得到的结果是当前词的概率。

generate-mode 的 score $\psi(y_t=v_i)$ 和普通的 encoder-decoder 是一样的。全链接之后的 softmax.

copy-mode:

$$\psi(y_t=x_j)=\sigma(h_j^TW_c)s_t,x_j\in \mathcal{V}$$

• $h_j$ 是 encoder hidden state. j 表示输入序列中的位置。

• $W_c\in R^{d_h\times d_s}$ 将 $h_j$ 映射到跟 $s_t$ 一样的语义空间。

• 作者发现使用 tanh 非线性变换效果更好。同时考虑到 $y_t$ 这个词可能在输入中出现多次，所以需要考虑输入序列中所有的为 $y_t$ 的词的概率的类和。

#### state update

$$c_t=\sum_{\tau=1}^{T_S}\alpha_{t\tau}$$

$$\alpha_{t\tau}=\dfrac{e^{\eta(s_{t-1},h_{\tau})}}{\sum_{\tau’}e^{\eta(s_{t-1},h_{\tau’})}}$$

CopyNet 的 $y_{t-1}$ 在这里有所不同。不仅仅考虑了词向量，还使用了 M 矩阵中特定位置的 hidden state，或者说，$y_{t−1}$ 的表示中就包含了这两个部分的信息 $[e(y_{t−1});\zeta(y_{t−1})]$，$e(y_{t−1})$ 是词向量，后面多出来的一项 $\zeta(y_{t−1})$ 叫做 selective read, 是为了连续拷贝较长的短语。和attention 的形式差不多，是 M 矩阵中 hidden state 的加权和.

$$\zeta(y_{t-1})=\sum_{\tau=1}^{T_S}\rho_{t\tau}h_{\tau}$$

$$\rho_{t\tau}=\begin{cases}\dfrac{1}{K}p(x_{\tau},c|s_{t-1},M),& x_{\tau}=y_{t-1}\ 0,& \text{otherwise} \end{cases}$$

• 当 $y_{t-1}$ 没有出现在 source sentence中时， $\zeta(y_{t-1})=0$.

• 这里的 $K=\sum{\tau’:x_{\tau’}=y_{t-1}}p(x_{\tau’},c|s_{t-1},M)$ 是类和。还是因为输入序列中可能出现多个当前词，但是每个词在 encoder hidden state 的向量表示是不一样的，因为他们的权重也是不一样的。

• 这里的 p 没有给出解释，我猜跟前面计算 copy 的 score 是一致的？

• 直观上来看，当 $\zeta(y_{t-1})$ 可以看作是选择性读取 M (selective read). 先计算输入序列中对应所有 $y_{t-1}$ 的权重，然后加权求和，也就是 $\zeta(y_{t-1})$.

#### Hybrid Adressing of M

$$\zeta(y_{t-1}) \longrightarrow{update} \ s_t \longrightarrow predict \ y_t \longrightarrow sel. read \zeta(y_t)$$

### Learning

$$L=-\dfrac{1}{N}\sum_{k=1}^N\sum_{t=1}^Tlog[p(y_t^{(k)}|y_{<t}^{(k)}, X^{(k)})]$$

N 是batch size，T 是 object sentence 长度。

# 论文笔记-Match LSTM

## Motivation

SQuAD the answers do not come from a small set of candidate

answers and they have variable lengths. We propose an end-to-end neural architecture for the task.

The architecture is based on match-LSTM, a model we proposed

previously for textual entailment, and Pointer Net, a sequence-to-sequence model proposed by Vinyals et al. (2015) to constrain the output tokens to be from the input sequences.

• MCTest: A challenge dataset for the open-domain machine comprehension of text.

• Teaching machines to read and comprehend.

• The Goldilocks principle: Reading children’s books with explicit memory representations.

• Towards AI-complete question answering: A set of prerequisite toy tasks.

• SQuAD: 100,000+ questions for machine comprehension of text.

Traditional solutions to this kind of question answering tasks rely on NLP pipelines that involve multiple steps of linguistic analyses and feature engineering, including syntactic parsing, named entity recognition, question classification, semantic parsing, etc. Recently, with the advances of applying neural network models in NLP, there has been much interest in building end-to-end neural architectures for various NLP tasks, including several pieces of work on machine comprehension.

End-to-end model architecture:

• Teaching machines to read and comprehend.

• The Goldilocks principle: Reading children’s books with explicit memory representations.

• Attention-based convolutional neural network for machine comprehension

• Text understanding with the attention sum reader network.

• Consensus attention-based neural networks for chinese reading comprehension.

However, given the properties of previous machine comprehension datasets, existing end-to-end neural architectures for the task either rely on the candidate answers (Hill et al., 2016; Yin et al., 2016) or assume that the answer is a single token (Hermann et al., 2015; Kadlec et al., 2016; Cui et al., 2016), which make these methods unsuitable for the SQuAD dataset.

We propose two ways to apply the Ptr-Net model for our task: a sequence model and a boundary model. We also further extend the boundary model with a search mechanism.

## Model Architecture

### Pointer Network

Pointer Network (Ptr-Net) model : to solve a special kind of problems where we want to generate an output sequence whose tokens must come from the input sequence. Instead of picking an output token from a fixed vocabulary, Ptr-Net uses attention mechanism as a pointer to select a position from the input sequence as an output symbol.

### MATCH-LSTM AND ANSWER POINTER

• An LSTM preprocessing layer that preprocesses the passage and the question using LSTMs. 使用 LSTM 处理 question 和 passage.

• A match-LSTM layer that tries to match the passage against the question. 使用 match-LSTM 对lstm编码后的 question 和 passage 进行匹配。

• An Answer Pointer (Ans-Ptr) layer that uses Ptr-Net to select a set of tokens from the passage as the answer. The difference between the two models only lies in the third layer. 使用 Pointer 来选择 tokens.

#### LSTM preprocessing Layer

$$H^p=\overrightarrow {LSTM}(P), H^q=\overrightarrow {LSTM}(Q)$$

#### Match-LSTM Layer

$$\overrightarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overrightarrow {h^r}_{i-1}+b^p)\otimes e_Q)\in R^{l\times Q}$$

$$\overrightarrow \alpha_i=softmax(w^T\overrightarrow G_i + b\otimes e_Q)\in R^{1\times Q}$$

the resulting attention weight $\overrightarrow α_{i,j}$ above indicates the degree of matching between the

$i^{th}$ token in the passage with the $j^{th}$ token in the question.

$$\overrightarrow z_i=\begin{bmatrix} h^p \ H^q\overrightarrow {\alpha_i^T} \ \end{bmatrix}$$

$$h^r=\overrightarrow{LSTM}(\overrightarrow{z_i},\overrightarrow{h^r_{i-1}})$$

$$\overleftarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overleftarrow {h^r}_{i-1}+b^p)\otimes e_Q)$$

$$\overleftarrow \alpha_i=softmax(w^T\overleftarrow G_i + b\otimes e_Q)$$

• $\overrightarrow {H^r}\in R^{l\times P}$ 表示隐藏状态 $[\overrightarrow {h^r_1}, \overrightarrow {h^r_2},…,\overrightarrow {h^r_P}]$.

• $\overleftarrow {H^r}\in R^{l\times P}$ 表示隐藏状态 $[\overleftarrow {h^r_1}, \overleftarrow {h^r_2},…,\overleftarrow {h^r_P}]$.

### Answer Pointer Layer

#### The Sequence Model

The answer is represented by a sequence of integers $a=(a_1,a_2,…)$ indicating the positions of the selected tokens in the original passage.

$$F_k=tanh(V\tilde {H^r}+(W^ah^a_{k-1}+b^a)\otimes e_{P+1})\in R^{l\times P+1}$$

$$\beta_k=softmax(v^TF_k+c\otimes e_{P+1}) \in R^{1\times (P+1)}$$

$$h_k^a=\overrightarrow{LSTM}(\tilde {H^r}\beta_k^T, h^a_{k-1})$$

$$p(a|H^r)=\prod_k p(a_k|a_1,a_2,…,a_{k-1}, H^r)$$

$$p(a_k=j|a_1,a_2,…,a_{k-1})=\beta_{k,j}$$

$$-\sum_{n=1}^N logp(a_n|P_n,Q_n)$$

#### The Boundary Model

So the main difference from the sequence model above is that in the boundary model we do not need to add the zero padding to Hr, and the probability of generating an answer is simply modeled as:

$$p(a|H^r)=p(a_s|H^r)p(a_e|a_s, H^r)$$

Search mechanism, and bi-directional Ans-Ptr.

### Training

#### Dataset

SQuAD: Passages in SQuAD come from 536 articles from Wikipedia covering a wide range of topics. Each passage is a single paragraph from a Wikipedia article, and each passage has around 5 questions associated with it. In total, there are 23,215 passages and 107,785 questions. The data has been split into a training set (with 87,599 question-answer pairs), a development set (with 10,570 questionanswer pairs) and a hidden test set

#### configuration

• dimension l of the hidden layers is set to 150 or 300.

• Adammax: $\beta_1=0.9, \beta_2=0.999$

• minibatch size = 30

• no L2 regularization.

# 论文笔记-QA BiDAF

paper:

### Motivation

Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query.

Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention.

In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bidirectional attention flow mechanism to obtain a query-aware context representation without early summarization.

### Introduction

Attention mechanisms in previous works typically have one or more of the following characteristics. First, the computed attention weights are often used to extract the most relevant information from the context for answering the question by summarizing the context into a fixed-size vector. Second, in the text domain, they are often temporally dynamic, whereby the attention weights at the current time step are a function of the attended vector at the previous time step. Third, they are usually uni-directional, wherein the query attends on the context paragraph or the image.

• 1.attention 的权重用来从 context 中提取最相关的信息，其中 context 压缩到一个固定 size 的向量。

• 2.在文本领域，context 中的表示在时间上是动态的。所以当前时间步的 attention 权重依赖于之前时间步的向量。

• 3.它们通常是单向的，用 query 查询内容段落或图像。

### Model Architecture

BiDAF 相比传统的将 attention 应用于 MC 任务作出如下改进:

• First, our attention layer is not used to summarize the context paragraph into a fixed-size vector. Instead, the attention is computed for every time step, and the attended vector at each time step, along with the representations from previous layers, is allowed to flow through to the subsequent modeling layer. This reduces the information loss caused by early summarization.

1）并没有把 context 编码到固定大小的向量表示中，而是让每个时间步计算得到的 attended vactor 可以流动（在 modeling layer 通过 biLSTM 实现）这样可以减少早期加权和造成的信息丢失。

• Second, we use a memory-less attention mechanism. That is, while we iteratively compute attention through time as in Bahdanau et al. (2015), the attention at each time step is a function of only the query and the context paragraph at the current time step and does not directly depend on the attention at the previous time step.

2）memory-less，在每一个时刻，仅仅对 query 和当前时刻的 context paragraph 进行计算，并不直接依赖上一时刻的 attention.

We hypothesize that this simplification leads to the division of labor between the attention layer and the modeling layer. It forces the attention layer to focus on learning the attention between the query and the context, and enables the modeling layer to focus on learning the interaction within the query-aware context representation (the output of the attention layer). It also allows the attention at each time step to be unaffected from incorrect attendances at previous time steps.

• Third, we use attention mechanisms in both directions, query-to-context and context-to-query, which provide complimentary information to each other.

Character Embedding Layer and Word Embedding Layer -> Contextual Embedding Layer -> Attention Flow Layer -> Modeling Layer -> Output Layer

#### Character Embedding Layer and word embedding alyer

• charatter embedding of each word using CNN.The outputs of the CNN are max-pooled over the entire width to obtain a fixed-size vector for each word.

• pre-trained word vectors, GloVe

• concatenation of them above and is passed to a two-layer highway networks.

context -> $X\in R^{d\times T}$

query -> $Q\in R^{d\times J}$

#### contextual embedding layer

model the temporal interactions between words using biLSTM.

context -> $H\in R^{2d\times T}$

query -> $U\in R^{2d\times J}$

#### attention flow layer

the attention flow layer is not used to summarize the query and context into single feature vectors. Instead, the attention vector at each time step, along with the embeddings from previous layers, are allowed to flow through to the subsequent modeling layer.

$$S_{tj}=\alpha(H_{:t},U_{:j})\in R$$

Context-to-query Attention:

$$a_t=softmax(S_{t:})\in R^J$$

$$\tilde U_{:t}=\sum_j a_{tj}U_{:j}\in R^{2d}$$

Query-to-context Attention:

$$b=softmax(max_{col}(S))\in R^T$$

$$\tilde h = \sum_tb_tH_{:t}\in R^{2d}$$

$$G_{:t}=\beta (H_{:t},\tilde U_{:t}, \tilde H_{:t})\in R^{d_G}$$

function $\beta$ 可以是 multi-layers perceptron. 在作者的实验中：

$$\beta(h,\tilde u,\tilde h)=[h;\tilde u;h\circ \tilde u;h\circ \tilde h]\in R^{8d\times T}$$

#### Modeling Layer

captures the interaction among the context words conditioned on the query.

#### Output Layer

$$p^1=softmax(W^T(p^1)[G;M])$$

$$p^2=softmax(W^T(p^2)[G;M^2])$$

### Training

$$L(\theta)=-{1 \over N} \sum^N_i[log(p^1_{y_i^1})+log(p^2_{y_i^2})]$$

$\theta$ 包括参数：

• the weights of CNN filters and LSTM cells

• $w_{S}$,$w_{p^1},w_{p^2}$

$y_i^1,y_i^2$ 表示i样本中开始可结束位置在 context 中的 index.

$p^1,p^2\in R^T$ 是经过 softmax 得到的概率，可以将 gold truth 看作是 one-hot 向量 [0,0,…,1,0,0,0]，所以对单个样本交叉熵是:

$$- log(p^1_{y_i^1})-log(p^2_{y_i^2})$$

### Test

The answer span $(k; l)$ where $k \le l$ with the maximum value of $p^1_kp^2_l$ is chosen, which can be computed in linear time with dynamic programming.

# 论文笔记 memory networks

Memory Networks 相关论文笔记。

• Memory Network with strong supervision
• End-to-End Memory Network
• Dynamic Memory Network

## Paper reading 1: Memory Networks, Jason Weston

### Motivation

RNNs 将信息压缩到final state中的机制，使得其对信息的记忆能力很有限。而memory work的提出就是对这一问题进行改善。

However, their memory (encoded by hidden states and weights) is typically too small, and is not compartmentalized enough to accurately remember facts from the past (knowledge is compressed into dense vectors). RNNs are known to have difficulty in performing memorization.

Memory Networks 提出的基本动机是我们需要 长期记忆（long-term memory）来保存问答的知识或者聊天的语境信息，而现有的 RNN 在长期记忆中表现并没有那么好。

### Memory Networks

#### four components:

• I:(input feature map)

• G:(generalization)

• O:(output feature map)

• R:(response)

#### 详细推导过程

1.I component: :encode input text to internal feature representation.

2.G component: generalization 就是结合 old memories和输入来更新 memories. $m_i=G(m_i, I(x),m), ∀i$

3.O component: reading from memories and performing inference, calculating what are the relevant memories to perform a good response.

$$o_1=O_1(q,m)=argmax_{i=1,2,..,N}s_O(q,m_i)$$

$$o_2=O_2(q,m)=argmax_{i=1,2,..,N}s_O([q,o_1],m_i)$$

output: $[q,o_1, o_2]$ 也是module R的输入.

$s_O$ is a function that scores the match between the pair of sentences x and mi. $s_O$ 用来表征 question x 和 记忆 $m_i$ 的相关程度。

$$s_O=qUU^Tm$$

$s_O$ 表示问题q和当前memory m的相关程度

U：bilinear regression参数，相关事实的 $qUU^Tm_{true}$ 的score高于不相关事实的分数 $qUU^Tm_{random}$

4.R component : 对 output feature o 进行解码，得到最后的response: r=R(o)

$$r=argmax_{w\in W}s_R([q,m_{o_1},m_{o_2}],w)$$

W 是词典，$s_R$ 表示与output feature o 最相关的单词。

$s_R$ 和 $s_O$ 的形式是相同的。

$$s(x,y)=xUU^Ty$$

#### Huge Memory 问题

• 可以按 entity 或者 topic 来存储 memory，这样 G 就不用在整个 memories 上操作了

• 如果 memory 满了，可以引入 forgetting 机制，替换掉没那么有用的 memory，H 函数可以计算每个 memory 的分数，然后重写

• 还可以对单词进行 hashing，或者对 word embedding 进行聚类，总之是把输入 I(x) 放到一个或多个 bucket 里面，然后只对相同 bucket 里的 memory 计算分数

### 损失函数

minimize: $L_i = \sum_{j\ne y_i}max(0,s_j - s_{y_i}+\Delta)$

QA实例：

(6) 有没有挑选出正确的第一句话

(7) 正确挑选出了第一句话后能不能正确挑出第二句话

(6)+(7) 合起来就是能不能挑选出正确的语境，用来训练 attention 参数

(8) 把正确的 supporting fact 作为输入，能不能挑选出正确的答案，来训练 response 参数

## Paper reading 2 End-To-End Memory Networks

### motivation

The model in that work was not easy to train via backpropagation, and required supervision at each layer of the network.

Our model can also be seen as a version of RNNsearch with multiple computational steps (which we term “hops”) per output symbol.

### Model architecture

#### Single layer

• input: $x_1,…,x_i$

• query: q

1.将input和query映射到特征空间

• memory vector {$m_i$}: ${x_i}\stackrel A\longrightarrow {m_i}$

• internal state u: $q\stackrel B \longrightarrow u$

2.计算attention，也就是query的向量表示u，和input中各个sentence的向量表示 $m_i$ 的匹配度。compute the match between u and each memory mi by taking the inner product followed by a softmax.

$$p_i=softmax(u^Tm_i)$$

p is a probability vector over the inputs.

3.得到context vector

• output vector: ${x_i}\stackrel C\longrightarrow {c_i}$

The response vector from the memory o is then a sum over the transformed inputs ci, weighted by the probability vector from the input:

$$o = \sum_ip_ic_i$$

4.预测最后答案，通常是一个单词

$$\hat a =softmax(Wu^{k+1})= softmax(W(o^k+u^k))$$

W可以看做反向embedding，W.shape=[embed_size, V]

5.对 $\hat a$ 进行解码，得到自然语言的response

$$\hat a \stackrel C \longrightarrow a$$

A: intput embedding matrix

C: output embedding matrix

W: answer prediction matrix

B: question embedding matrix

#### Multiple Layers/ Multiple hops

$$u_{k+1}=u^k+o^k$$

### 对比上一篇paper来理解

• input components: 就是将query和sentences映射到特征空间中

• generalization components： 更新memory，这里的memory也是在变化的，${m_i}=AX$， 但是embedding matrix A 是逐层变化的

• output components: attention就是根据inner product后softmax计算memory和query之间的匹配度，然后更新input，也就是[u_k,o_k]， 可以是相加/拼接，或者用RNN. 区别是，在上一篇论文中是argmax，$o_2=O_2(q,m)=argmax_{i=1,2,..,N}s_O([q,o_1],m_i)$, 也就是选出匹配程度最大的 memory $m_i$, 而这篇论文是对所有的memory进行加权求和

• response components: 跟output components类似啊，上一篇论文是与词典中所有的词进行匹配，求出相似度最大的 $r=argmax_{w\in W}s_R([q,m_{o_1},m_{o_2}],w)$，而这篇论文是 $\hat a=softmax(Wu^{k+1})=softmax(W(u^k+o^k))$ 最小化交叉熵损失函数训练得到 answer prediction matrix W.

Overall, it is similar to the Memory Network model in [23], except that the hard max operations within each layer have been replaced with a continuous weighting from the softmax.

### 一些技术细节

• 上一层的output embedding matrix 是下一层的 input embedding matrix, 即 $A^{k+1}=C^k$

• 最后一层的output embedding 可用作 prediction embedding matrix， 即 $W^T=C^k$

• question embedding matrix = input embedding matrix of the first layer, $B=A^1$

1. Layer-wise (RNN-like)
• $A^1=A^2=…=A^k, C^1=C^2=…C^k$

• $u^{k+1} = Hu^k+o^k$

### Experiments

#### Modle details

##### Sentence representations

1.Bag of words(BOW) representation

$$m_i=\sum_jAx_{ij}$$

$$c_i=\sum_jCx_{ij}$$

$$u=\sum_jBq_j$$

2.encodes the position of words within the sentence 考虑词序的编码

$$m_i=\sum_jl_j\cdot Ax_{ij}$$

i表示第i个sentence，j表示这个sentence中的第j个word

$$l_{kj}=(1-j/J)-(k/d)(1-2j/J)$$

$$l_{kj} = 1+4(k- (d+1)/2)(j-(J+1)/2)/d/J$$

wolframalpha1

wolframalpha2

position encoding 代码实现

##### Temporal Encoding

$$m_i=\sum_jAx_{ij}+T_A(i)$$

$$c_i=\sum_jCx_{ij}+T_C(i)$$

##### Learning time invariance by injecting random noise

we have found it helpful to add “dummy” memories to regularize TA.

#### Training Details

1.learning rate decay

3.linear start training

### 完整代码实现

https://github.com/PanXiebit/text-classification/blob/master/06-memory%20networks/memn2n_model.py

## Paper reading 3 Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

### Motivation

Most tasks in natural language processing can be cast into question answering (QA) problems over language input.

### Model Architecture

• Input Module: 将输入文本编码为distribution representations

• Question Module: 将question编码为distribution representations

• Episodic Memory Module: 通过attention机制选择focus on输入文本中的某些部分，然后生成memory vector representation.

• Answer Module: 依据the final memory vector生成answer

Detailed visualization:

#### Input Module

1.输入是single sentence，那么input module输出的就是通过RNN计算得到的隐藏状态 $T_C= T_I$, $T_I$ 表示一个sentence中的词的个数。

2.输入是a list of sentences，在每个句子后插入一个结束符号 end-of-sentence token, 然后每个sentence的final hidden作为这个sentence的representation. 那么input module输出 $T_C$, $T_C$等于sequence的sentence个数。

#### Question Module

$$q_t=GRU(L[w_t^Q],q_{t-1})$$

L代表embedding matrix.

$$q=q_{T_Q}$$

$T_Q$ 是question的词的个数。

#### Episodic Memory Module

1.Needs for multiple Episodes: 通过迭代使得模型具有了传递推理能力 transitive inference.

2.Attention Mechanism: 使用了一个gating function作为attention机制。相比在 end-to-end MemNN 中attention使用的是linear regression，即对inner production通过softmax求权重。 这里使用一个两层前向神经网络 G 函数.

$$g_t^i=G(c_t,m^{i-1},q)$$

$c_t$ 是candidate fact, $m_{i-1}$ 是previous memory， question q. t 表示sentence中的第t时间步，i表示episodic的迭代次数。

$$z_t^i=[c_t, m^{i-1},q, c_t\circ q,c_t\circ m^{i-1},|c_t-q|,|c_t-m^{i-1}|, c_t^TW^{(b)}q, c_t^TW^{(b)}m^{i-1}]$$

$$G = \sigma(W^{(2)}tanh(W^{(1)}z_i^t+b^{(1)})+b^{(2)})$$