Hierarchical Attention Networks for Document Classification

the Hierarchical Attention Network (HAN) that is designed to capture two basic insights about document structure. First, since documents have a hierarchical structure (words form sentences, sentences form a document), we likewise construct a document representation by first building representations of sentences and then aggregating those into a document representation. Second, it is observed that different words and sentences in a documents are differentially informative.

the importance of words and sentences are highly context dependent, i.e. the same word or sentence may be differentially important in different context (x3.5). To include sensitivity to this fact, our model includes two levels of attention mechanisms (Bahdanau et al., 2014; Xu et al., 2015) — one at the word level and one at the sentence level — that let the model to pay more or less attention to individual words and sentences when constructing the representation of the document.

words和sentences都是高度上下文依赖的，同一个词或sentence在不同的上下文中，其表现的重要性会有差别。因此，这篇论文中使用了两个attention机制，来表示结合了上下文信息的词或句子的重要程度。（这里结合的上下文的词或句子，就是经过RNN处理后的隐藏状态）。

Attention serves two benefits: not only does it often result in better performance, but it also provides insight into which words and sentences contribute to the classification decision which can be of value in applications and analysis (Shen et al., 2014; Gao et al., 2014)

attention不仅有好的效果，而且能够可视化的看见哪些词或句子对哪一类document的分类影响大。

### Model Architecture

#### GRU-based sequence encoder

reset gate: controls how much the past state contributes to the candidate state. $r_t=\sigma(W_rx_t+U_rh_{t-1}+b_r)$

candidate state: $\tilde h_t=tanh(W_hx_t+r_t\circ (U_hh_{t-1})+b_h)$

update gate: decides how much past information is kept and how much new information is added. $z_t=\sigma(W_zx_t+U_zh_{t-1}+b_z)$

new state: a linear interpolation between the previous state $h_{t−1}$ and the current new state $\tilde h_t$ computed with new sequence information. $h_t=(1-z_t)\circ h_{t-1}+z_t\circ \tilde h_t$

#### Hierarchical Attention

##### Word Encoder

$x_{it}=W_ew_{it}, t\in [1, T]$ $\overrightarrow h_{it}=\overrightarrow {GRU}(x_{it}),t\in[1,T]$ $\overleftarrow h_{it}=\overleftarrow {GRU}(x_{it}),t\in [T,1]$

$h_{it} = [\overrightarrow h_{it},\overleftarrow h_{it}]$

i means the $i^{th}$ sentence in the document, and t means the $t^{th}$ word in the sentence.

##### Word Attention

Not all words contribute equally to the representation of the sentence meaning. Hence, we introduce attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.

Attention机制说到底就是给予sentence中每个结合了上下文信息的词一个权重。关键在于这个权重怎么确定？

$u_{it}=tanh(W_wh_{it}+b_w)$ $\alpha_{it}=\dfrac{exp(u_{it}^Tu_w)}{\sum_t^Texp(u_{it}^Tu_w)}$ $s_i=\sum_t^T\alpha_{it}h_{it}$

The context vector $u_w$ can be seen as a high level representation of a fixed query “what is the informative word” over the words like that used in memory networks (Sukhbaatar et al., 2015, End-to-end memory networks.; Kumar et al., 2015, Ask me anything: Dynamic memory networks for natural language processing.). The word context vector $u_w$ is randomly initialized and jointly learned during the training process.

##### Sentence Encoder

$\overrightarrow h_{i}=\overrightarrow {GRU}(s_{i}),t\in[1,L]$ $\overleftarrow h_{i}=\overleftarrow {GRU}(s_{i}),t\in [L,1]$

$H_i=[\overrightarrow h_{i}, \overleftarrow h_{i}]$

hi summarizes the neighbor sentences around sentence i but still focus on sentence i.

##### Sentence Attention

$u_i=tanh(W_sH_i+b_s)$ $\alpha_i=\dfrac{exp(u_i^Tu_s)}{\sum_i^Lexp(u_i^Tu_s)}$ $v = \sum_i^L\alpha_ih_i$

#### Document Classification

The document vector v is a high level representation of the document and can be used as features for document classification: $p=softmax(W_cv+b_c)$

### 代码实现

#### 需要注意的问题

• 如果使用tensorboard可视化
• 变量范围的问题