论文笔记-Match LSTM

Motivation

SQuAD the answers do not come from a small set of candidate answers and they have variable lengths. We propose an end-to-end neural architecture for the task.
针对 SQuAD 这样的阅读理解式任务提出的端到端的模型。 SQuAD 的答案不是从候选词中提取,而是类似于人类的回答,是不同长度的句子。

The architecture is based on match-LSTM, a model we proposed previously for textual entailment, and Pointer Net, a sequence-to-sequence model proposed by Vinyals et al. (2015) to constrain the output tokens to be from the input sequences.
主要是基于 Pointer Networks

关于阅读理解的数据集 benchmark dataset:
- MCTest: A challenge dataset for the open-domain machine comprehension of text.
- Teaching machines to read and comprehend.
- The Goldilocks principle: Reading children’s books with explicit memory representations.
- Towards AI-complete question answering: A set of prerequisite toy tasks.
- SQuAD: 100,000+ questions for machine comprehension of text.

SQuAD

Traditional solutions to this kind of question answering tasks rely on NLP pipelines that involve multiple steps of linguistic analyses and feature engineering, including syntactic parsing, named entity recognition, question classification, semantic parsing, etc. Recently, with the advances of applying neural network models in NLP, there has been much interest in building end-to-end neural architectures for various NLP tasks, including several pieces of work on machine comprehension.
传统的智能问答任务整个流程包括 句法分析、命名实体识别、问题分类、语义分析等。。随着深度学习的发展,端到端的模型开始出现。

End-to-end model architecture:
- Teaching machines to read and comprehend.
- The Goldilocks principle: Reading children’s books with explicit memory representations.
- Attention-based convolutional neural network for machine comprehension
- Text understanding with the attention sum reader network.
- Consensus attention-based neural networks for chinese reading comprehension.

However, given the properties of previous machine comprehension datasets, existing end-to-end neural architectures for the task either rely on the candidate answers (Hill et al., 2016; Yin et al., 2016) or assume that the answer is a single token (Hermann et al., 2015; Kadlec et al., 2016; Cui et al., 2016), which make these methods unsuitable for the SQuAD dataset.
之前的模型的 answer 要么是从候选答案中选择,要么是一个简单的符号。这都不适合 SQuDA.

模型是基于作者早期提出的用于 textual entailment 的 match-LSTMLearning natural language inference with LSTM,然后进一步应用了 Pointer Net(https://papers.nips.cc/paper/5866-pointer-networks), 从而允许预测的结果能够从输入中获得,而不是从一个固定的词表中获取。

We propose two ways to apply the Ptr-Net model for our task: a sequence model and a boundary model. We also further extend the boundary model with a search mechanism.
作者提出的两种模型。

Model Architecture

Match-LSTM

Pointer Network

Pointer Network (Ptr-Net) model : to solve a special kind of problems where we want to generate an output sequence whose tokens must come from the input sequence. Instead of picking an output token from a fixed vocabulary, Ptr-Net uses attention mechanism as a pointer to select a position from the input sequence as an output symbol.
从输入 sentences 中生成 answer.

类似于 Pointer Network 的模型:
- Incorporating copying mechanism in sequence-to-sequence learning.
- Text understanding with the attention sum reader network.

MATCH-LSTM AND ANSWER POINTER

模型主要分为3部分:

  • An LSTM preprocessing layer that preprocesses the passage and the question using LSTMs. 使用 LSTM 处理 question 和 passage.
  • A match-LSTM layer that tries to match the passage against the question. 使用 match-LSTM 对lstm编码后的 question 和 passage 进行匹配。
  • An Answer Pointer (Ans-Ptr) layer that uses Ptr-Net to select a set of tokens from the passage as the answer. The difference between the two models only lies in the third layer. 使用 Pointer 来选择 tokens.

LSTM preprocessing Layer

\[H^p=\overrightarrow {LSTM}(P), H^q=\overrightarrow {LSTM}(Q)\]

直接使用单向LSTM,每一个时刻的隐含层向量输出 \(H^p\in R^{l\times P}, H^q\in R^{l\times Q}\) 只包含左侧上下文信息.

Match-LSTM Layer

\[\overrightarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overrightarrow {h^r}_{i-1}+b^p)\otimes e_Q)\in R^{l\times Q}\] \[\overrightarrow \alpha_i=softmax(w^T\overrightarrow G_i + b\otimes e_Q)\in R^{1\times Q}\]

the resulting attention weight \(\overrightarrow α_{i,j}\) above indicates the degree of matching between the \(i^{th}\) token in the passage with the \(j^{th}\) token in the question.

其中 \(W^q,W^p,W^r \in R^{l\times l}, b^p,w\in R^l, b\in R\)

所以 \(\overrightarrow α_{i}\) 表示整个 question 与 passage 中的第 i 个词之间的 match 程度,也就是通常理解的 attention 程度。

传统的 attention 就是将 passage 和 question 矩阵相乘,比如 transformer 中 query 和 keys 相乘。复杂一点可能就是 dynamic memory networks 中的将 两个需要 match 的向量相减、element-wise相乘之后,使用两层的前馈神经网络来表示。

这里的 attention score 的计算方式又不一样了。 \(\overrightarrow{h^r_{i-1}}\) 是通过 LSTM 耦合 weighted queston 和 passage 中上一个词得到的信息。

其中: \[\overrightarrow z_i=\begin{bmatrix} h^p \\ H^q\overrightarrow {\alpha_i^T} \\ \end{bmatrix} \] \[h^r=\overrightarrow{LSTM}(\overrightarrow{z_i},\overrightarrow{h^r_{i-1}})\]

然后类似于LSTM将 \(\overrightarrow{h_{i-1}^r}\) 和 当前 passage 的表示 \(H^p_i\) 耦合得到的 \(R^{l\times 1}\) 的向量重复Q 次,得到 \(R^{l\times Q}\),所以 \(\overrightarrow G_i\in R^{l\times Q}\), 在通过一个softmax-affine网络得到 attention weights.

整个思路下来,就是 attention score 不是通过矩阵相乘,也不是向量 \(h^p_i, H^q\) 相减之后通过神经网络得到。但是也相似,就是对当前要匹配的两个向量 \(h^p_i, H^q\) 通过两层神经网络得到,其中的对当前向量 \(H_i^p\)\(\overrightarrow {h_{i-1}^r}\) 要重复 Q 次。。。其实跟 DMN 还是相似的,只不过不是简单的 attention 当前的向量,还用了 LSTM 来耦合之前的信息。

最终得到想要的结合了 attention 和 LSTM 的输出 \(\overrightarrow h^r\).

作者做了一个反向的 LSTM. 方式是一样的:
\[\overleftarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overleftarrow {h^r}_{i-1}+b^p)\otimes e_Q)\] \[\overleftarrow \alpha_i=softmax(w^T\overleftarrow G_i + b\otimes e_Q)\]

同样得到 \(\overleftarrow {h_i^r}\).

  • \(\overrightarrow {H^r}\in R^{l\times P}\) 表示隐藏状态 \([\overrightarrow {h^r_1}, \overrightarrow {h^r_2},...,\overrightarrow {h^r_P}]\).
  • \(\overleftarrow {H^r}\in R^{l\times P}\) 表示隐藏状态 \([\overleftarrow {h^r_1}, \overleftarrow {h^r_2},...,\overleftarrow {h^r_P}]\).

然后把两者堆叠起来得到通过 question 匹配之后的 passage 向量表示: \(H^r=\begin{bmatrix} \overrightarrow H^r \\ \overleftarrow H^r \end{bmatrix} \in R^{2l\times P}\)

Answer Pointer Layer

The Sequence Model

The answer is represented by a sequence of integers \(a=(a_1,a_2,...)\) indicating the positions of the selected tokens in the original passage.

再一次利用 attention,\(\beta_{k,j}\) 表示 answer 中第 k 个token选择 passage 中第 j 个次的概率。所以 \(\beta_k\in R^{P+1}\).

\[F_k=tanh(V\tilde {H^r}+(W^ah^a_{k-1}+b^a)\otimes e_{P+1})\in R^{l\times P+1}\] \[\beta_k=softmax(v^TF_k+c\otimes e_{P+1}) \in R^{1\times (P+1)}\]

其中 \(\tilde H\in R^{2l\times (P+1)}\) 表示 \(H^r\) 和 zero vector 的叠加, \(\tilde H=[H^r, 0], V\in R^{l\times 2l}, W^a\in R^{l\times l}, b^a,v\in R, c\in R\).

所以还是跟 match-LSTM 一样,先对 \(H^r\) 中的每一个词通过全链接表示 \(W^ah^a_{k+1}+b^a\), 然后重复 P+1 次,得到 \(R^{l\times (P+1)}\). 在通过激活函数 tanh, 再通过一个全连接神经网络,然后使用 softmax 进行多分类。

\[h_k^a=\overrightarrow{LSTM}(\tilde {H^r}\beta_k^T, h^a_{k-1})\]

这里是把 \(\tilde H^r\) 与权重 \(\beta_k\) 矩阵相乘之后的结果作为 LSTM k 时刻的输入。很玄学, 感觉可以看作是 self-attention 结合了 LSTM.

对生成 answer sequence 的概率进行建模:
\[p(a|H^r)=\prod_k p(a_k|a_1,a_2,...,a_{k-1}, H^r)\]

其中: \[p(a_k=j|a_1,a_2,...,a_{k-1})=\beta_{k,j}\]

目标函数 loss function: \[-\sum_{n=1}^N logp(a_n|P_n,Q_n)\]

The Boundary Model

So the main difference from the sequence model above is that in the boundary model we do not need to add the zero padding to Hr, and the probability of generating an answer is simply modeled as: \[p(a|H^r)=p(a_s|H^r)p(a_e|a_s, H^r)\]

Search mechanism, and bi-directional Ans-Ptr.

Training

Dataset

SQuAD: Passages in SQuAD come from 536 articles from Wikipedia covering a wide range of topics. Each passage is a single paragraph from a Wikipedia article, and each passage has around 5 questions associated with it. In total, there are 23,215 passages and 107,785 questions. The data has been split into a training set (with 87,599 question-answer pairs), a development set (with 10,570 questionanswer pairs) and a hidden test set

configuration

  • dimension l of the hidden layers is set to 150 or 300.
  • Adammax: \(\beta_1=0.9, \beta_2=0.999\)
  • minibatch size = 30
  • no L2 regularization.

Result