论文笔记-Multi-cast Attention Networks

paper: Multi-Cast Attention Networks for Retrieval-based Question Answering and Response Prediction

Motivation

Our approach performs a series of soft attention operations, each time casting a scalar feature upon the inner word embeddings. The key idea is to provide a real-valued hint (feature) to a subsequent encoder layer and is targeted at improving the representation learning process.
在 encoder layer 之前将 document 和 query 进行交互,然后将权重赋予到 document 和 query 之后,在通过 contextual/encoder layer 编码融合了上下文信息的向量表示。这样做的目地,是为后续层提供提示(特征),提升表示学习的性能。

The key idea of attention is to extract only the most relevant information that is useful for prediction. In the context of textual data, attention learns to weight words and sub-phrases within documents based on how important they are. In the same vein, co-attention mechanisms [5, 28, 50, 54] are a form of attention mechanisms that learn joint pairwise attentions, with respect to both document and query.
attention 注意力的关键思想是仅提取对预测有用的最相关信息。在文本数据的上下文中,注意力学习根据文档中的单词和子短语的重要性来对它们进行加权。

Attention is traditionally used and commonly imagined as a feature extractor. It’s behavior can be thought of as a dynamic form of pooling as it learns to select and compose different words to form the final document representation.
传统的 attention 可以看做为一个特征提取器。它的行为可以被认为是一种动态的pooing形式,因为它学习选择和组合不同的词来形成最终的文档表示,。

This paper re-imagines attention as a form of feature augmentation method. Attention is casted with the purpose of not compositional learning or pooling but to provide hints for subsequent layers. To the best of our knowledge, this is a new way to exploit attention in neural ranking models.
这篇 paper 将 attention 重新设想为一种特征增强的方式,Attention的目的不是组合学习或汇集,而是为后续层提供提示(特征)。这是一种在神经排序模型中的新方法。

不管这篇paper提供的新的 attention 使用方式是否是最有效的,但这里对 attention 的很多解释让人耳目一新,可以说理解的很透彻了。

An obvious drawback which applies to many existing models is that they are generally restricted to one attention variant. In the case where one or more attention calls are used (e.g., co-attention and intra-attention, etc.), concatenation is generally used to fuse representations [20, 28]. Unfortunately, this incurs cost in subsequent layers by doubling the representation size per call.
很多 paper 中只受限于一种 attention,这明显是不够好的。也有使用多种 attention 的,比如 co-attention 和 intra-attention, 然后直接拼接起来。这样会使得接下来的 modeling layer 维度加倍。。(在 ai challenge 的比赛中就是这么干的。。没啥不好啊。。)

The rationale for desiring more than one attention call is intuitive. In [20, 28], Co-Attention and Intra-Attention are both used because each provides a different view of the document pair, learning high quality representations that could be used for prediction. Hence, this can significantly improve performance.
直觉上,使用多种 multi-attention 是靠谱的。 co-attention 和 intra-attention 提供了不同的视角去审视 document,用以学习高质量的向量表示。

关于各种 attention:

  • https://zhuanlan.zhihu.com/p/35041012
  • [50] co-attention: dynamic coattention networks for question answering
  • [5] [Attentive Pooling Networks](https://arxiv.org/abs/1602.03609)
  • [28] [Inter-Weighted Alignment Network for Sentence Pair Modeling](https://aclanthology.info/pdf/D/D17/D17-1122.pdf)
  • [54] [Attentive Interactive Neural Networks for Answer Selection in Community Question](https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14611)
  • intra-attention:Attention is all your need

Moreover, Co-Attention also comes in different flavors and can either be used with extractive max-mean pooling [5, 54] or alignment-based pooling [3, 20, 28]. Each co-attention type produces different document representations. In max-pooling, signals are extracted based on a word’s largest contribution to the other text sequence. Mean-pooling calculates its contribution to the overall sentence. Alignment-pooling is another flavor of co-attention, which aligns semantically similar sub-phrases together.
co-attention可以用于提取max-mean pooling或alignment-based pooling。每种co-attention都会产生不同的文档表示。在max-pooling中,基于单词对另一文本序列的最大贡献来提取特征;mean-pooling计算其对整个句子的贡献;alignment-based pooling是另一种协同注意力机制,它将语义相似的子短语对齐在一起。因此,不同的pooling操作提供了不同的句子对视图。

  • [3] [Enhanced LSTM for Natural Language Inference](https://arxiv.org/abs/1609.06038)
  • [20] [A Decomposable Attention Model for Natural Language Inference](https://arxiv.org/abs/1606.01933)

Our approach is targeted at serving two important purposes: (1) It removes the need for architectural engineering of this component by enabling attention to be called for an arbitrary k times with hardly any consequence and (2) concurrently it improves performance by modeling multiple views via multiple attention calls. As such, our method is in similar spirit to multi-headed attention, albeit efficient. To this end, we introduce Multi-Cast Attention Networks (MCAN), a new deep learning architecture for a potpourri of tasks in the question answering and conversation modeling domains.
两个方面的贡献:
(1)消除调用任意k次注意力机制所需架构工程的需要,且不会产生任何后果。
(2)通过多次注意力调用建模多个视图以提高性能,与multi-headed attention相似。

In our approach, attention is casted, in contrast to the most other works that use it as a pooling operation. We cast co-attention multiple times, each time returning a compressed scalar feature that is re-attached to the original word representations. The key intuition is that compression enables scalable casting of multiple attention calls, aiming to provide subsequent layers with a hint of not only global knowledge but also cross sentence knowledge.
与大多数其他用作池化操作的工作相反,在我们的方法中,注意力被投射。通过多次投射co-attention,每次返回一个压缩的标量特征,重新附加到原始的单词表示上。压缩函数可以实现多个注意力调用的可扩展投射,旨在不仅为后续层提供全局知识而且还有跨句子知识的提示(特征)。

Model Architecture: Multi-cast Attention Networks

Figure 1: Illustration of our proposed Multi-Cast Attention Networks (Best viewed in color). MCAN is a wide multi-headed attention architecture that utilizes compression functions and attention as features.

模型输入是 document/query 语句对。

Input Encoder

embedding layer

映射到向量空间: \(w\in W^d\)

Highway Encoder:

Highway Networks可以对任意深度的网络进行优化。这是通过一种控制穿过神经网络的信息流的闸门机制所实现的。通过这种机制,神经网络可以提供通路,让信息穿过后却没有损失,将这种通路称为information highways。即highway networks主要解决的问题是网络深度加深、梯度信息回流受阻造成网络训练困难的问题。

highway encoders can be interpreted as data-driven word filters. As such, we can imagine them to parametrically learn which words have an inclination to be important and not important to the task at hand. For example, filtering stop words and words that usually do not contribute much to the prediction. Similar to recurrent models that are gated in nature, this highway encoder layer controls how much information (of each word) is flowed to the subsequent layers.
在本文模型中,每个词向量都通过highway编码器层。highway网络是门控非线性变换层,它控制后续层的信息流。许多工作都采用一种训练过的投影层来代替原始词向量。这不仅节省了计算成本,还减少了可训练参数的数量。本文将此投影层扩展为使用highway编码器,可以解释为数据驱动的词滤波器,它们可以参数化地了解哪些词对于任务具有重要性和重要性。例如,删除通常对预测没有多大贡献的停用词和单词。与自然门控的循环模型类似,highway编码器层控制每个单词流入下一层多少信息。

\[y=H(x,W_H)\cdot T(x,W_T) + (1-T(x,W_T))\cdot x\]

其中 \(W_H, W_T\in R^{r\times d}\) 是可学习参数. H(.) 和 T(.) 分别是全连接加上 relu 和 sigmoid 的函数,用以控制信息的流向下一层。

co-attention

Co-Attention [50] is a pairwise attention mechanism that enables attending to text sequence pairs jointly. In this section, we introduce four variants of attention, i.e., (1) max-pooling, (2) mean-pooling, (3) alignment-pooling, and finally (4) intra-attention (or self attention).
协同注意力机制是成对的注意力机制,能够同时关注文本序列对。作者引入了 4 中注意力机制。

1.affinity/similarity matrix

\[s_{ij}=F(q_i)^TF(d_j)\] 其中,F(.) 是多层感知机,通常可选择的计算相似矩阵的方式有: \[s_{ij}=q_i^TMd_j, s_{ij}=F[q_i;d_j]\]

以及在 BiDAF 和 QANet 中使用的 \(s_{ij}=F[q_i;d_j;q_i\circ d_j]\).

2. Extractive pooling

max-pooling

关注于另一个序列交互后,最匹配的那个词。 \[q'=soft(max_{col}(s))^Tq, d'=soft(max_{row}(s))^Td\]

soft(.) 是 softmax 函数。\(q',d'\) 是 co-attentive representations of q and d respectively.

mean-pooling

关注另一个句子的全部,取平均值。 \[q'=soft(mean_{col}(s))^Tq, d'=soft(mean_{row}(s))^Td\]

each pooling operator has different impacts and can be intuitively understood as follows: max-pooling selects each word based on its maximum importance of all words in the other text. Mean-pooling is a more wholesome comparison, paying attention to a word based on its overall influence on the other text. This is usually dataset-dependent, regarded as a hyperparameter and is tuned to see which performs best on the held out set.
不同的 pooling 操作有不同的影响,获取的信息也不一样。 max-pooling根据每个单词在其他文本中所有单词的最大重要性选择每个单词。mean-pooling是基于每个词在其他文本上的总体影响来关注每个词。
这其实是与数据集和任务相关的。可以看作超参数,然后调整看哪个在对应的任务和数据集上表现更佳。

3. Alignment-Pooling

\[d_i':=\sum^{l_q}_{j=1}\dfrac{exp(s_{ij})}{\sum_{k=1}^{l_q}exp(s_{ik})}q_j\]

其中 \(d_i'\) 是 q 和 \(d_i\) 的软对齐。直观的说,\(d_i'\) 是关于 \(d_i\)\(\{q_j\}^{l_q}_{j=1}\) 的加权和。

\[q_i':=\sum^{l_d}_{j=1}\dfrac{exp(s_{ij})}{\sum_{k=1}^{l_d}exp(s_{ik})}d_j\]

\(q_i'\)\(q_i\) 和 d 的软对齐。也就是,\(q_i'\) 是 关于 \(q_i\)\(\{d_j\}^{l_d}_{j=1}\) 的加权和。

4. intra-Attention

\[x_i':=\sum^{l}_{j=1}\dfrac{exp(s_{ij})}{\sum_{k=1}^{l}exp(s_{ik})}x_j\]

也就是自注意力机制。相比 attention is all your need,可能就是相似矩阵不一样吧。

Multi-Cast Attention

Casted Attention

\(x\) 来表示 q 或 d,\(\overline x\) 表示 经过 co-attention 和 soft-attention alignment 后的序列表示 \(q', d'\).

\[f_c=F_c[\overline x, x]\] \[f_m=F_m[\overline x \circ x]\] \[f_s=F_m[\overline x-x]\]

其中 \(\circ\) 是 Hadamard product. \(F(.)\) 是压缩函数,将特征压缩到 scalar.

Intuitively, what is achieved here is that we are modeling the influence of co-attention by comparing representations before and after co-attention. For soft-attention alignment, a critical note here is that x and \(\overline x\) (though of equal lengths) have ‘exchanged’ semantics. In other words, in the case of q, \(\overline q\) actually contains the aligned representation of d.

Compression Function

The rationale for compression is simple and intuitive - we do not want to bloat subsequent layers with a high dimensional vector which consequently incurs parameter costs in subsequent layers. We investigate the usage of three compression functions, which are capable of reducing a n dimensional vector to a scalar.
本节定义了Fc(.) 使用的压缩函数,不希望使用高维向量膨胀后续层,这会在后续层中会产生参数成本。因此本文研究了三种压缩函数的用法,它们能够将n维向量减少到标量。

  • Sum
    Sum(SM)函数是一个非参数化函数,它对整个向量求和,并输出标量 \[F(x)=\sum_i^nx_i, x_i\in x\]

  • Neural networks \[F(x)=RELU(W_cx+b)\] 其中 \(W_c\in R^{n\times 1}\)

  • Factorization Machines 因子分解机是一种通用机器学习技术,接受实值特征向量 \(x\in R^n\) 并返回标量输出。

FM是表达模型,使用分解参数捕获特征之间的成对相互作用。 k是FM模型的因子数。

Multi-Cast

我们的架构背后的关键思想是促进k个注意力投射,每个投射都用一个实值注意力提示来增强原始词向量。 对于每个query-document对,应用Co-Attention with mean-pooling,Co-Attention with max-Pooling和Co-Attention with alignment-pooling。 此外,将Intra-Attention分别单独应用于query和document。 每个注意力投射产生三个标量(每个单词),它们与词向量连接在一起。最终的投射特征向量是 \(z\in R^{12}\)

因此,对于每个单词 w_{i} ,新的表示成为 \(\bar{w_{i}}=[w_{i};z_{i}]\)

Long Short-Term Memory Encoder

接下来,将带有casted attetnion的单词表示 \(\bar{w_{1}},\bar{w_{2}},...,\bar{w_{l}}\) 传递到序列编码器层。采用标准的长短期记忆(LSTM)编码器.

As such, the key idea behind casting attention as features right before this layer is that it provides the LSTM encoder with hints that provide information such as (1) longterm and global sentence knowledge and (2) knowledge between sentence pairs (document and query).
LSTM在document和query之间共享权重。 关键思想是LSTM编码器通过使用非线性变换作为门控函数来学习表示序列依赖性的表示。因此,在该层之前引人注意力作为特征的关键思想是它为LSTM编码器提供了带有信息的提示,例如长期和全局句子知识和句子对(文档和查询)之间的知识。

  • Pooling Operation
    最后,在每个句子的隐藏状态 \(h_{1},...h_{l}\) 上应用池化函数。将序列转换为固定维度的表示。 \[h=MeanMax[h_1,...,h_l]\]

所以得到的 q 和 d 的最终表示是 [1, hiden_size]?

Prediction Layer and Optimization

\[y_{out} = H_2(H_1([x_q; x_d ; x_q \circ x_d ; x_q − x_d ]))\]

其中 \(H_1,H_2\) 是具有 ReLU 激活的 highway 网络层。然后将输出传递到最终线性 softmax 层。

\[y_{pred} = softmax(W_F · y_{out} + b_F )\]

其中 \(W_F\in R^{h\times 2}, b_F\in R^2\).

使用 multi-class cross entropy,并带有 L2 正则化。