# 论文笔记-Match LSTM

## Motivation

SQuAD the answers do not come from a small set of candidate

answers and they have variable lengths. We propose an end-to-end neural architecture for the task.

The architecture is based on match-LSTM, a model we proposed

previously for textual entailment, and Pointer Net, a sequence-to-sequence model proposed by Vinyals et al. (2015) to constrain the output tokens to be from the input sequences.

• MCTest: A challenge dataset for the open-domain machine comprehension of text.

• Teaching machines to read and comprehend.

• The Goldilocks principle: Reading children’s books with explicit memory representations.

• Towards AI-complete question answering: A set of prerequisite toy tasks.

• SQuAD: 100,000+ questions for machine comprehension of text.

Traditional solutions to this kind of question answering tasks rely on NLP pipelines that involve multiple steps of linguistic analyses and feature engineering, including syntactic parsing, named entity recognition, question classification, semantic parsing, etc. Recently, with the advances of applying neural network models in NLP, there has been much interest in building end-to-end neural architectures for various NLP tasks, including several pieces of work on machine comprehension.

End-to-end model architecture:

• Teaching machines to read and comprehend.

• The Goldilocks principle: Reading children’s books with explicit memory representations.

• Attention-based convolutional neural network for machine comprehension

• Text understanding with the attention sum reader network.

• Consensus attention-based neural networks for chinese reading comprehension.

However, given the properties of previous machine comprehension datasets, existing end-to-end neural architectures for the task either rely on the candidate answers (Hill et al., 2016; Yin et al., 2016) or assume that the answer is a single token (Hermann et al., 2015; Kadlec et al., 2016; Cui et al., 2016), which make these methods unsuitable for the SQuAD dataset.

We propose two ways to apply the Ptr-Net model for our task: a sequence model and a boundary model. We also further extend the boundary model with a search mechanism.

## Model Architecture

### Pointer Network

Pointer Network (Ptr-Net) model : to solve a special kind of problems where we want to generate an output sequence whose tokens must come from the input sequence. Instead of picking an output token from a fixed vocabulary, Ptr-Net uses attention mechanism as a pointer to select a position from the input sequence as an output symbol.

### MATCH-LSTM AND ANSWER POINTER

• An LSTM preprocessing layer that preprocesses the passage and the question using LSTMs. 使用 LSTM 处理 question 和 passage.

• A match-LSTM layer that tries to match the passage against the question. 使用 match-LSTM 对lstm编码后的 question 和 passage 进行匹配。

• An Answer Pointer (Ans-Ptr) layer that uses Ptr-Net to select a set of tokens from the passage as the answer. The difference between the two models only lies in the third layer. 使用 Pointer 来选择 tokens.

#### LSTM preprocessing Layer

$$H^p=\overrightarrow {LSTM}(P), H^q=\overrightarrow {LSTM}(Q)$$

#### Match-LSTM Layer

$$\overrightarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overrightarrow {h^r}_{i-1}+b^p)\otimes e_Q)\in R^{l\times Q}$$

$$\overrightarrow \alpha_i=softmax(w^T\overrightarrow G_i + b\otimes e_Q)\in R^{1\times Q}$$

the resulting attention weight $\overrightarrow α_{i,j}$ above indicates the degree of matching between the

$i^{th}$ token in the passage with the $j^{th}$ token in the question.

$$\overrightarrow z_i=\begin{bmatrix} h^p \ H^q\overrightarrow {\alpha_i^T} \ \end{bmatrix}$$

$$h^r=\overrightarrow{LSTM}(\overrightarrow{z_i},\overrightarrow{h^r_{i-1}})$$

$$\overleftarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overleftarrow {h^r}_{i-1}+b^p)\otimes e_Q)$$

$$\overleftarrow \alpha_i=softmax(w^T\overleftarrow G_i + b\otimes e_Q)$$

• $\overrightarrow {H^r}\in R^{l\times P}$ 表示隐藏状态 $[\overrightarrow {h^r_1}, \overrightarrow {h^r_2},…,\overrightarrow {h^r_P}]$.

• $\overleftarrow {H^r}\in R^{l\times P}$ 表示隐藏状态 $[\overleftarrow {h^r_1}, \overleftarrow {h^r_2},…,\overleftarrow {h^r_P}]$.

### Answer Pointer Layer

#### The Sequence Model

The answer is represented by a sequence of integers $a=(a_1,a_2,…)$ indicating the positions of the selected tokens in the original passage.

$$F_k=tanh(V\tilde {H^r}+(W^ah^a_{k-1}+b^a)\otimes e_{P+1})\in R^{l\times P+1}$$

$$\beta_k=softmax(v^TF_k+c\otimes e_{P+1}) \in R^{1\times (P+1)}$$

$$h_k^a=\overrightarrow{LSTM}(\tilde {H^r}\beta_k^T, h^a_{k-1})$$

$$p(a|H^r)=\prod_k p(a_k|a_1,a_2,…,a_{k-1}, H^r)$$

$$p(a_k=j|a_1,a_2,…,a_{k-1})=\beta_{k,j}$$

$$-\sum_{n=1}^N logp(a_n|P_n,Q_n)$$

#### The Boundary Model

So the main difference from the sequence model above is that in the boundary model we do not need to add the zero padding to Hr, and the probability of generating an answer is simply modeled as:

$$p(a|H^r)=p(a_s|H^r)p(a_e|a_s, H^r)$$

Search mechanism, and bi-directional Ans-Ptr.

### Training

#### Dataset

SQuAD: Passages in SQuAD come from 536 articles from Wikipedia covering a wide range of topics. Each passage is a single paragraph from a Wikipedia article, and each passage has around 5 questions associated with it. In total, there are 23,215 passages and 107,785 questions. The data has been split into a training set (with 87,599 question-answer pairs), a development set (with 10,570 questionanswer pairs) and a hidden test set

#### configuration

• dimension l of the hidden layers is set to 150 or 300.

• Adammax: $\beta_1=0.9, \beta_2=0.999$

• minibatch size = 30

• no L2 regularization.

Xie Pan

2018-08-14

2021-06-29