论文笔记-Multi-cast Attention Networks

paper: [Multi-Cast Attention Networks for Retrieval-based Question

Motivation

Our approach performs a series of soft attention

operations, each time casting a scalar feature upon the inner word embeddings. The key idea is to provide a real-valued hint (feature) to a subsequent encoder layer and is targeted at improving the representation learning process.

The key idea of attention is to extract only the most relevant information that is useful for prediction. In the context of textual data, attention learns to weight words and sub-phrases within documents based on how important they are. In the same vein, co-attention mechanisms [5, 28, 50, 54] are a form of attention mechanisms that learn joint pairwise attentions, with respect to both document and query.

attention 注意力的关键思想是仅提取对预测有用的最相关信息。在文本数据的上下文中，注意力学习根据文档中的单词和子短语的重要性来对它们进行加权。

Attention is traditionally used and commonly imagined as a feature extractor. It’s behavior can be thought of as a dynamic form of pooling as it learns to select and compose different words to form the final document representation.

This paper re-imagines attention as a form of feature augmentation method. Attention is casted with the purpose of not compositional learning or pooling but to provide hints for subsequent layers. To the best of our knowledge, this is a new way to exploit attention in neural ranking models.

An obvious drawback which applies to many existing models is that they are generally restricted to one attention variant. In the case where one or more attention calls are used (e.g., co-attention and intra-attention, etc.), concatenation is generally used to fuse representations [20, 28]. Unfortunately, this incurs cost in subsequent layers by doubling the representation size per call.

The rationale for desiring more than one attention call is intuitive. In [20, 28], Co-Attention and Intra-Attention are both used because each provides a different view of the document pair, learning high quality representations that could be used for prediction. Hence, this can significantly improve performance.

Network for Sentence Pair Modeling](https://aclanthology.info/pdf/D/D17/D17-1122.pdf)

Moreover, Co-Attention also comes in different flavors and can either be used with extractive max-mean pooling [5, 54] or alignment-based pooling [3, 20, 28]. Each co-attention type produces different document representations. In max-pooling, signals are extracted based on a word’s largest contribution to the other text sequence. Mean-pooling calculates its contribution to the overall sentence. Alignment-pooling is another flavor of co-attention, which aligns semantically similar sub-phrases together.

co-attention可以用于提取max-mean pooling或alignment-based pooling。每种co-attention都会产生不同的文档表示。在max-pooling中，基于单词对另一文本序列的最大贡献来提取特征；mean-pooling计算其对整个句子的贡献；alignment-based pooling是另一种协同注意力机制，它将语义相似的子短语对齐在一起。因此，不同的pooling操作提供了不同的句子对视图。

• [20] [A

Decomposable Attention Model for Natural Language Inference](https://arxiv.org/abs/1606.01933)

Our approach is targeted at serving two important purposes: (1) It removes the need for architectural engineering of this component by enabling attention to be called for an arbitrary k times with hardly any consequence and (2) concurrently it improves performance by modeling multiple views via multiple attention calls. As such, our method is in similar spirit to multi-headed attention, albeit efficient. To this end, we introduce Multi-Cast Attention Networks (MCAN), a new deep learning architecture for a potpourri of tasks in the question answering and conversation modeling domains.

（1）消除调用任意k次注意力机制所需架构工程的需要，且不会产生任何后果。

In our approach, attention is casted, in contrast to the most other works that use it as a pooling operation. We cast co-attention multiple times, each time returning a compressed scalar feature that is re-attached to the original word representations. The key intuition is that compression enables scalable casting of multiple attention calls, aiming to provide subsequent layers with a hint of not only global knowledge but also cross sentence knowledge.

Model Architecture: Multi-cast Attention Networks

Figure 1: Illustration of our proposed Multi-Cast Attention Networks (Best viewed in color). MCAN is a wide multi-headed attention architecture that utilizes compression functions and attention as features.

Input Encoder

Highway Encoder：

Highway Networks可以对任意深度的网络进行优化。这是通过一种控制穿过神经网络的信息流的闸门机制所实现的。通过这种机制，神经网络可以提供通路，让信息穿过后却没有损失，将这种通路称为information highways。即highway networks主要解决的问题是网络深度加深、梯度信息回流受阻造成网络训练困难的问题。

highway encoders can be interpreted as data-driven word filters. As such, we can imagine them to parametrically learn which words have an inclination to be important and not important to the task at hand. For example, filtering stop words and words that usually do not contribute much to the prediction. Similar to recurrent models that are gated in nature, this highway encoder layer controls how much information (of each word) is flowed to the subsequent layers.

$$y=H(x,W_H)\cdot T(x,W_T) + (1-T(x,W_T))\cdot x$$

co-attention

Co-Attention [50] is a pairwise attention mechanism that enables

attending to text sequence pairs jointly. In this section, we introduce four variants of attention, i.e., (1) max-pooling, (2) mean-pooling, (3) alignment-pooling, and finally (4) intra-attention (or self attention).

1.affinity/similarity matrix

$$s_{ij}=F(q_i)^TF(d_j)$$

$$s_{ij}=q_i^TMd_j, s_{ij}=F[q_i;d_j]$$

2. Extractive pooling

max-pooling

$$q’=soft(max_{col}(s))^Tq, d’=soft(max_{row}(s))^Td$$

soft(.) 是 softmax 函数。$q’,d’$ 是 co-attentive representations of q and d respectively.

mean-pooling

$$q’=soft(mean_{col}(s))^Tq, d’=soft(mean_{row}(s))^Td$$

each pooling operator has different impacts and can be intuitively understood as follows: max-pooling selects each word based on its maximum importance of all words in the other text. Mean-pooling is a more wholesome comparison, paying attention to a word based on its overall influence on the other text. This is usually dataset-dependent, regarded as a hyperparameter and is tuned to see which performs best on the held out set.

3. Alignment-Pooling

$$d_i’:=\sum^{l_q}{j=1}\dfrac{exp(s{ij})}{\sum_{k=1}^{l_q}exp(s_{ik})}q_j$$

$$q_i’:=\sum^{l_d}{j=1}\dfrac{exp(s{ij})}{\sum_{k=1}^{l_d}exp(s_{ik})}d_j$$

$q_i’$ 是 $q_i$ 和 d 的软对齐。也就是，$q_i’$ 是 关于 $q_i$ 的 ${d_j}^{l_d}_{j=1}$ 的加权和。

4. intra-Attention

$$x_i’:=\sum^{l}{j=1}\dfrac{exp(s{ij})}{\sum_{k=1}^{l}exp(s_{ik})}x_j$$

Multi-Cast Attention

Casted Attention

$$f_c=F_c[\overline x, x]$$

$$f_m=F_m[\overline x \circ x]$$

$$f_s=F_m[\overline x-x]$$

Intuitively, what is achieved here is that we are modeling the influence of co-attention by comparing representations before and after co-attention. For soft-attention alignment, a critical note here is that x and $\overline x$ (though of equal lengths) have ‘exchanged’ semantics. In other words, in the case of q, $\overline q$ actually contains the aligned representation of d.

Compression Function

The rationale for compression is simple and intuitive - we do not want to bloat subsequent layers with a high dimensional vector which consequently incurs parameter costs in subsequent layers. We investigate the usage of three compression functions, which are capable of reducing a n dimensional vector to a scalar.

• Sum

Sum（SM）函数是一个非参数化函数，它对整个向量求和，并输出标量

$$F(x)=\sum_i^nx_i, x_i\in x$$

• Neural networks

$$F(x)=RELU(W_cx+b)$$

• Factorization Machines

FM是表达模型，使用分解参数捕获特征之间的成对相互作用。 k是FM模型的因子数。

Long Short-Term Memory Encoder

As such, the key idea behind casting attention as features right before this layer is that it provides the LSTM encoder with hints that provide information such as (1) longterm and global sentence knowledge and (2) knowledge between sentence pairs (document and query).

LSTM在document和query之间共享权重。 关键思想是LSTM编码器通过使用非线性变换作为门控函数来学习表示序列依赖性的表示。因此，在该层之前引人注意力作为特征的关键思想是它为LSTM编码器提供了带有信息的提示，例如长期和全局句子知识和句子对（文档和查询）之间的知识。

• Pooling Operation

$$h=MeanMax[h_1,…,h_l]$$

Prediction Layer and Optimization

$$y_{out} = H_2(H_1([x_q; x_d ; x_q \circ x_d ; x_q − x_d ]))$$

$$y_{pred} = softmax(W_F · y_{out} + b_F )$$