# 论文笔记-sentence embedding

## supervised learning

### a structured self-attentive sentence embedding

#### Model Architecture:

word embedding: $S\in R^{n\times d}$, d 表示词向量维度

$$S=(w_1,w_2,…,w_n)$$

bidirection-LSTM: $H\in R^{n\times 2u}$, u 表示隐藏状态维度

$$H=(h_1,h_2,…,h_n)$$

single self-attention: $a\in R^n$, 表示 sentence 中对应位置的权重。与 encoder 之后的 sentence 加权求和得到 attention vector $m\in R^{2u}$.

$$a=softmax(w_{s2}tanh(W_{s1}H^T))$$

r-dim self-attention：有 r 个上述的 attention vector，并转换成矩阵形式，$A\in R^{n\times r}$ 与 encode 之后的句子表示 H 加权求和得到 embedding matrix $M\in R^{r\times 2u}$

$$A=softmax(W_{s2}tanh(W_{s1}H^T))$$

$$M=AH$$

#### penalization term

The best way to evaluate the diversity is definitely the Kullback Leibler divergence between any 2 of the summation weight vectors. However, we found that not very stable in our case. We conjecture it is because we are maximizing a set of KL divergence (instead of minimizing only one, which is the usual case), we are optimizing the annotation matrix A to have a lot of sufficiently small or even zero values at different softmax output units, and these vast amount of zeros is making the training unstable. There is another feature that KL doesn’t provide but we want, which is, we want each individual row to focus on a single aspect of semantics, so we want the probability mass in the annotation softmax output to be more focused. but with KL penalty we cant encourage that.

$$P=||(AA^T-I)||^2_{F}$$

$AA^T$ 是协方差矩阵，对角线元素是同一向量的内积，非对角线元素不同向量的内积。将其作为惩罚项加到 original loss 上，期望得到的是不同 vector 内积越小越好（内积越小，差异越大），并且向量的模长越大越好（概率分布更集中于某一两个词）。

#### training

3 different datasets:

• the Age dataset

• the Yelp dataset

• the Stanford Natural Language Inference (SNLI) Corpus