# 论文笔记-QA BiDAF

paper:

### Motivation

Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query.

Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention.

In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bidirectional attention flow mechanism to obtain a query-aware context representation without early summarization.

### Introduction

Attention mechanisms in previous works typically have one or more of the following characteristics. First, the computed attention weights are often used to extract the most relevant information from the context for answering the question by summarizing the context into a fixed-size vector. Second, in the text domain, they are often temporally dynamic, whereby the attention weights at the current time step are a function of the attended vector at the previous time step. Third, they are usually uni-directional, wherein the query attends on the context paragraph or the image.

• 1.attention 的权重用来从 context 中提取最相关的信息，其中 context 压缩到一个固定 size 的向量。

• 2.在文本领域，context 中的表示在时间上是动态的。所以当前时间步的 attention 权重依赖于之前时间步的向量。

• 3.它们通常是单向的，用 query 查询内容段落或图像。

### Model Architecture

BiDAF 相比传统的将 attention 应用于 MC 任务作出如下改进:

• First, our attention layer is not used to summarize the context paragraph into a fixed-size vector. Instead, the attention is computed for every time step, and the attended vector at each time step, along with the representations from previous layers, is allowed to flow through to the subsequent modeling layer. This reduces the information loss caused by early summarization.

1）并没有把 context 编码到固定大小的向量表示中，而是让每个时间步计算得到的 attended vactor 可以流动（在 modeling layer 通过 biLSTM 实现）这样可以减少早期加权和造成的信息丢失。

• Second, we use a memory-less attention mechanism. That is, while we iteratively compute attention through time as in Bahdanau et al. (2015), the attention at each time step is a function of only the query and the context paragraph at the current time step and does not directly depend on the attention at the previous time step.

2）memory-less，在每一个时刻，仅仅对 query 和当前时刻的 context paragraph 进行计算，并不直接依赖上一时刻的 attention.

We hypothesize that this simplification leads to the division of labor between the attention layer and the modeling layer. It forces the attention layer to focus on learning the attention between the query and the context, and enables the modeling layer to focus on learning the interaction within the query-aware context representation (the output of the attention layer). It also allows the attention at each time step to be unaffected from incorrect attendances at previous time steps.

• Third, we use attention mechanisms in both directions, query-to-context and context-to-query, which provide complimentary information to each other.

Character Embedding Layer and Word Embedding Layer -> Contextual Embedding Layer -> Attention Flow Layer -> Modeling Layer -> Output Layer

#### Character Embedding Layer and word embedding alyer

• charatter embedding of each word using CNN.The outputs of the CNN are max-pooled over the entire width to obtain a fixed-size vector for each word.

• pre-trained word vectors, GloVe

• concatenation of them above and is passed to a two-layer highway networks.

context -> $X\in R^{d\times T}$

query -> $Q\in R^{d\times J}$

#### contextual embedding layer

model the temporal interactions between words using biLSTM.

context -> $H\in R^{2d\times T}$

query -> $U\in R^{2d\times J}$

#### attention flow layer

the attention flow layer is not used to summarize the query and context into single feature vectors. Instead, the attention vector at each time step, along with the embeddings from previous layers, are allowed to flow through to the subsequent modeling layer.

$$S_{tj}=\alpha(H_{:t},U_{:j})\in R$$

Context-to-query Attention:

$$a_t=softmax(S_{t:})\in R^J$$

$$\tilde U_{:t}=\sum_j a_{tj}U_{:j}\in R^{2d}$$

Query-to-context Attention:

$$b=softmax(max_{col}(S))\in R^T$$

$$\tilde h = \sum_tb_tH_{:t}\in R^{2d}$$

$$G_{:t}=\beta (H_{:t},\tilde U_{:t}, \tilde H_{:t})\in R^{d_G}$$

function $\beta$ 可以是 multi-layers perceptron. 在作者的实验中：

$$\beta(h,\tilde u,\tilde h)=[h;\tilde u;h\circ \tilde u;h\circ \tilde h]\in R^{8d\times T}$$

#### Modeling Layer

captures the interaction among the context words conditioned on the query.

#### Output Layer

$$p^1=softmax(W^T(p^1)[G;M])$$

$$p^2=softmax(W^T(p^2)[G;M^2])$$

### Training

$$L(\theta)=-{1 \over N} \sum^N_i[log(p^1_{y_i^1})+log(p^2_{y_i^2})]$$

$\theta$ 包括参数：

• the weights of CNN filters and LSTM cells

• $w_{S}$,$w_{p^1},w_{p^2}$

$y_i^1,y_i^2$ 表示i样本中开始可结束位置在 context 中的 index.

$p^1,p^2\in R^T$ 是经过 softmax 得到的概率，可以将 gold truth 看作是 one-hot 向量 [0,0,…,1,0,0,0]，所以对单个样本交叉熵是:

$$- log(p^1_{y_i^1})-log(p^2_{y_i^2})$$

### Test

The answer span $(k; l)$ where $k \le l$ with the maximum value of $p^1_kp^2_l$ is chosen, which can be computed in linear time with dynamic programming.

Xie Pan

2018-08-07

2021-06-29