论文笔记-QA BiDAF

paper:
- BiDAF:Bidirectional Attention Flow for Machine Comprehension
- Match-LSTM:Machine Comprehension Using Match-LSTM and Answer Pointer

Motivation

Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query.
机器阅读的定义,query 和 context 之间的交互。

Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention.
传统的使用 attention 机制的方法。

In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bidirectional attention flow mechanism to obtain a query-aware context representation without early summarization.
本文提出的方法 BiDAF. 使用多阶层次双向 attention flow 机制来表示内容的不同 levels 的粒度,从而获得 query-aware 的 context,而不使用 summarization.

Introduction

Attention mechanisms in previous works typically have one or more of the following characteristics. First, the computed attention weights are often used to extract the most relevant information from the context for answering the question by summarizing the context into a fixed-size vector. Second, in the text domain, they are often temporally dynamic, whereby the attention weights at the current time step are a function of the attended vector at the previous time step. Third, they are usually uni-directional, wherein the query attends on the context paragraph or the image.

对 atention 在以前的研究中的特性做了一个总结。 - 1.attention 的权重用来从 context 中提取最相关的信息,其中 context 压缩到一个固定 size 的向量。 - 2.在文本领域,context 中的表示在时间上是动态的。所以当前时间步的 attention 权重依赖于之前时间步的向量。
- 3.它们通常是单向的,用 query 查询内容段落或图像。

Model Architecture

BiDAF 相比传统的将 attention 应用于 MC 任务作出如下改进:
- >First, our attention layer is not used to summarize the context paragraph into a fixed-size vector. Instead, the attention is computed for every time step, and the attended vector at each time step, along with the representations from previous layers, is allowed to flow through to the subsequent modeling layer. This reduces the information loss caused by early summarization.
1)并没有把 context 编码到固定大小的向量表示中,而是让每个时间步计算得到的 attended vactor 可以流动(在 modeling layer 通过 biLSTM 实现)这样可以减少早期加权和造成的信息丢失。

  • Second, we use a memory-less attention mechanism. That is, while we iteratively compute attention through time as in Bahdanau et al. (2015), the attention at each time step is a function of only the query and the context paragraph at the current time step and does not directly depend on the attention at the previous time step.
    2)memory-less,在每一个时刻,仅仅对 query 和当前时刻的 context paragraph 进行计算,并不直接依赖上一时刻的 attention.
    We hypothesize that this simplification leads to the division of labor between the attention layer and the modeling layer. It forces the attention layer to focus on learning the attention between the query and the context, and enables the modeling layer to focus on learning the interaction within the query-aware context representation (the output of the attention layer). It also allows the attention at each time step to be unaffected from incorrect attendances at previous time steps.
    也就是对 attention layer 和 modeling layer 进行分工,前者关注于 context 和 query 之间的交互。而后者则关注于 query-aware context 中词于词之间的交互,也就是加权了 attention weights 之后的 context 表示。这使得 attention 在每个时间步不受之前错误的影响。

  • Third, we use attention mechanisms in both directions, query-to-context and context-to-query, which provide complimentary information to each other.
    计算了 query-to-context(Q2C) 和 context-to-query(C2Q)两个方向的 attention 信息,认为 C2Q 和 Q2C 实际上能够相互补充。实验发现模型在开发集上去掉 C2Q 与 去掉 Q2C 相比,分别下降了 12 和 10 个百分点,显然 C2Q 这个方向上的 attention 更为重要

论文提出6层结构: Character Embedding Layer and Word Embedding Layer -> Contextual Embedding Layer -> Attention Flow Layer -> Modeling Layer -> Output Layer

Character Embedding Layer and word embedding alyer

  • charatter embedding of each word using CNN.The outputs of the CNN are max-pooled over the entire width to obtain a fixed-size vector for each word.
  • pre-trained word vectors, GloVe
  • concatenation of them above and is passed to a two-layer highway networks.

context -> \(X\in R^{d\times T}\)
query -> \(Q\in R^{d\times J}\)

contextual embedding layer

model the temporal interactions between words using biLSTM.
context -> \(H\in R^{2d\times T}\)
query -> \(U\in R^{2d\times J}\)

前三层网络是在不同的粒度层面来提取 context 和 query 的特征。

attention flow layer

the attention flow layer is not used to summarize the query and context into single feature vectors. Instead, the attention vector at each time step, along with the embeddings from previous layers, are allowed to flow through to the subsequent modeling layer.

输入是 H 和 G,输出是 query-aware vector G, 以及上一层的 contextual layer.

这一层包含两个 attention,Context-to-query Attention 和 Query-to-context Attention. 它们共享相似矩阵 \(S\in R^{T\times J}\)(不是简单的矩阵相乘,而是类似于 Dynamic Memory Networks 中的计算方式).
\[S_{tj}=\alpha(H_{:t},U_{:j})\in R\] 其中 \(\alpha(h,u)=w_{(S)}^T[h,u,h\circ u]\), \(w_{(S)}\in R^{6d}\)

Context-to-query Attention:
计算对每一个 context word 而言哪些 query words 和它最相关。所以 计算 t-th context word 对应的 query 每个词的权重: \[a_t=softmax(S_{t:})\in R^J\] 然后将权重赋予到 query 上然后再加权求和(叠加赋予了权重的 query 中的每一个词),得到 t-th 对应的 query-aware query:
\[\tilde U_{:t}=\sum_j a_{tj}U_{:j}\in R^{2d}\] 然后 context 中的每一个词都这样计算,\(\tilde U\in R^{2d\times T}\)

就是通过 context 和 query 计算相似性后,通过 sortmax 转化为概率,然后作为权重赋予到 query 上,得到 context 每一个词对应的 attended-query.

Query-to-context Attention:
跟 C2Q 一样计算相似矩阵 S 后,计算对每一个 query word 而言哪些 context words 和它最相关,这些 context words 对回答问题很重要。
先计算相关性矩阵每一列中的最大值,max function \(max_{col}(S)\in R^T\), 然后softmax计算概率:
\[b=softmax(max_{col}(S))\in R^T\]
权重 b 表示与整个 query 比较之后,context 中每一个词的重要程度,然后与 context 加权和: \[\tilde h = \sum_tb_tH_{:t}\in R^{2d}\] 在 tile T 次后得到 \(\tilde H\in R^{2d\times T}\).

比较 C2Q 和 Q2C,显然 Q2C 更重要,因为最终我们要找的答案是 context 中的内容。而且两者的 attention 计算方式有区别是:对 query 进行加权和时,我们考虑的是 context 中的每一个词,而在对 context 进行加权和时,我们要考虑所有的 query 中相关性最大的词,是因为 context 中某个词只要与 query 中任何一个词有关,都需要被 attend.

将三个矩阵拼接起来,得到 G: \[G_{:t}=\beta (H_{:t},\tilde U_{:t}, \tilde H_{:t})\in R^{d_G}\]

function \(\beta\) 可以是 multi-layers perceptron. 在作者的实验中: \[\beta(h,\tilde u,\tilde h)=[h;\tilde u;h\circ \tilde u;h\circ \tilde h]\in R^{8d\times T}\]

Modeling Layer

captures the interaction among the context words conditioned on the query.

使用 biLSTM, 单向 LSTM 的输出维度是d,所以最终输出: \(M\in R^{2d\times T}\).

Output Layer

输出 layer 是基于应用确定的。如果是 QA,就从段落中找出 start p1 和 end p2.
计算 start index: \[p^1=softmax(W^T(p^1)[G;M])\] 其中 \(w_{(p^1)}\in R^{10d}\)

计算 end index,将 M 通过另一个 biLSTM 处理,得到 \(M^2\in R^{2d\times T}\) \[p^2=softmax(W^T(p^2)[G;M^2])\]

Training

目标损失函数:
\[L(\theta)=-{1 \over N} \sum^N_i[log(p^1_{y_i^1})+log(p^2_{y_i^2})]\]

\(\theta\) 包括参数:
- the weights of CNN filters and LSTM cells
- \(w_{S}\),\(w_{p^1},w_{p^2}\)

\(y_i^1,y_i^2\) 表示i样本中开始可结束位置在 context 中的 index.

\(p^1,p^2\in R^T\) 是经过 softmax 得到的概率,可以将 gold truth 看作是 one-hot 向量 [0,0,...,1,0,0,0],所以对单个样本交叉熵是: \[- log(p^1_{y_i^1})-log(p^2_{y_i^2})\]

Test

The answer span \((k; l)\) where \(k \le l\) with the maximum value of \(p^1_kp^2_l\) is chosen, which can be computed in linear time with dynamic programming.