文本分类系列5-Hierarchical Attention Networks

Hierarchical Attention Networks for Document Classification

paper reading

主要原理:

the Hierarchical Attention Network (HAN) that is designed to capture two basic insights about document structure. First, since documents have a hierarchical structure (words form sentences, sentences form a document), we likewise construct a document representation by first building representations of sentences and then aggregating those into a document representation. Second, it is observed that different words and sentences in a documents are differentially informative.

对于一个document含有这样的层次结构,document由sentences组成,sentence由words组成。

the importance of words and sentences are highly context dependent, i.e. the same word or sentence may be differentially important in different context (x3.5). To include sensitivity to this fact, our model includes two levels of attention mechanisms (Bahdanau et al., 2014; Xu et al., 2015) — one at the word level and one at the sentence level — that let the model to pay more or less attention to individual words and sentences when constructing the representation of the document.

words和sentences都是高度上下文依赖的,同一个词或sentence在不同的上下文中,其表现的重要性会有差别。因此,这篇论文中使用了两个attention机制,来表示结合了上下文信息的词或句子的重要程度。(这里结合的上下文的词或句子,就是经过RNN处理后的隐藏状态)。

Attention serves two benefits: not only does it often result in better performance, but it also provides insight into which words and sentences contribute to the classification decision which can be of value in applications and analysis (Shen et al., 2014; Gao et al., 2014)

attention不仅有好的效果,而且能够可视化的看见哪些词或句子对哪一类document的分类影响大。

本文的创新点在于,考虑了ducument中sentence这一层次结构,因为对于一个document的分类,可能前面几句话都是废话,而最后一句话来了一个转折,对document的分类起决定性作用。而之前的研究,只考虑了document中的词。

Model Architecture

GRU-based sequence encoder

reset gate: controls how much the past state contributes to the candidate state. \[r_t=\sigma(W_rx_t+U_rh_{t-1}+b_r)\]

candidate state: \[\tilde h_t=tanh(W_hx_t+r_t\circ (U_hh_{t-1})+b_h)\]

update gate: decides how much past information is kept and how much new information is added. \[z_t=\sigma(W_zx_t+U_zh_{t-1}+b_z)\]

new state: a linear interpolation between the previous state \(h_{t−1}\) and the current new state \(\tilde h_t\) computed with new sequence information. \[h_t=(1-z_t)\circ h_{t-1}+z_t\circ \tilde h_t\]

Hierarchical Attention

Word Encoder

\[x_{it}=W_ew_{it}, t\in [1, T]\] \[\overrightarrow h_{it}=\overrightarrow {GRU}(x_{it}),t\in[1,T]\] \[\overleftarrow h_{it}=\overleftarrow {GRU}(x_{it}),t\in [T,1]\]

\[h_{it} = [\overrightarrow h_{it},\overleftarrow h_{it}]\]

i means the \(i^{th}\) sentence in the document, and t means the \(t^{th}\) word in the sentence.

Word Attention

Not all words contribute equally to the representation of the sentence meaning. Hence, we introduce attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.

Attention机制说到底就是给予sentence中每个结合了上下文信息的词一个权重。关键在于这个权重怎么确定?

\[u_{it}=tanh(W_wh_{it}+b_w)\] \[\alpha_{it}=\dfrac{exp(u_{it}^Tu_w)}{\sum_t^Texp(u_{it}^Tu_w)}\] \[s_i=\sum_t^T\alpha_{it}h_{it}\]

这里首先是将 \(h_{it}\) 通过一个全连接层得到 hidden representation \(u_{it}\),然后计算 \(u_{it}\)\(u_w\) 的相似性。并通过softmax归一化得到每个词与 \(u_w\) 相似的概率。越相似的话,这个词所占比重越大,对整个sentence的向量表示影响越大。

那么关键是这个 \(u_w\) 怎么表示?

The context vector \(u_w\) can be seen as a high level representation of a fixed query “what is the informative word” over the words like that used in memory networks (Sukhbaatar et al., 2015, End-to-end memory networks.; Kumar et al., 2015, Ask me anything: Dynamic memory networks for natural language processing.). The word context vector \(u_w\) is randomly initialized and jointly learned during the training process.

Sentence Encoder

\[\overrightarrow h_{i}=\overrightarrow {GRU}(s_{i}),t\in[1,L]\] \[\overleftarrow h_{i}=\overleftarrow {GRU}(s_{i}),t\in [L,1]\]

\[H_i=[\overrightarrow h_{i}, \overleftarrow h_{i}]\]

hi summarizes the neighbor sentences around sentence i but still focus on sentence i.

Sentence Attention

\[u_i=tanh(W_sH_i+b_s)\] \[\alpha_i=\dfrac{exp(u_i^Tu_s)}{\sum_i^Lexp(u_i^Tu_s)}\] \[v = \sum_i^L\alpha_ih_i\]

同样的 \(u_s\) 表示: a sentence level context vector \(u_s\)

Document Classification

The document vector v is a high level representation of the document and can be used as features for document classification: \[p=softmax(W_cv+b_c)\]

代码实现

需要注意的问题

  • 如果使用tensorboard可视化
  • 变量范围的问题

Context dependent attention weights

Visualization of attention