### What do word2vec capture?

• Go through each word of the whole corpus

• predict surrounding words of each word

• word2vec captures coocurrence of words one at a time.

Why not capture coocurrence counts directly?

word2vec将窗口视作训练单位，每个窗口或者几个窗口都要进行一次参数更新。要知道，很多词串出现的频次是很高的。能不能遍历一遍语料，迅速得到结果呢？

### Window based Co-occurrence Matrix 基于窗口的共现矩阵

Solution: Low dimensional vectors

SVD的问题：

• 计算复杂度高：对n×m的矩阵是O(mn2)

• 不方便处理新词或新文档

• 与其他DL模型训练套路不同

Count based VS direct prediction

### 2. Combining the beat of both words: Glove

GloVe: Global Vectors for Word Representation

Glove的原理： Using global statistics to predict the probability of word j appearing in the context of word i with a least squares objective. 即利用了词频统计的作用，又利用了word2vec中出现在同一个窗口的两个词的概率，用词向量做内积来表示。

$$P(u_c|\hat v) = \dfrac{exp(u_c^T\hat v)}{\sum_{j=1}^{|V|}exp(u_j^T\hat v)}\tag{* }$$

#### 共现矩阵 Co-occurrence Matrix

$$P_{ij}=P(w_j|w_i)=\dfrac{X_{ij}}{X_i}=\dfrac{count(w_i,w_j)}{count(w_i)}$$

#### Least Squares Objective

$$Q_{ij}=P(w_j|w_i) = \dfrac{exp(u_j^Tv_i)}{\sum_{j=1}^{|V|}exp(u_j^Tv_i)}$$

$$J=-\sum_{i\in corpus}\sum_{j\in context(i)}Q_{ij}$$

$$J=-\sum_{i=1}^V\sum_{j=1}^VX_{ij}Q_{ij}$$

$$\hat J = \sum_{i=1}^V\sum_{j=1}^VX_i(\hat P_{ij}-\hat Q_{ij})$$

\begin{align} \hat J&=\sum_{i=1}^V\sum_{j=1}^V(log(\hat P)_ {ij}-log(\hat Q)_ {ij})\ &=\sum_{i=1}^V\sum_{j=1}^VX_i(u_j^Tv_i-logX_{ij})^2 \end{align}

Glove的优点：

• 训练迅速：也需要遍历整个语料库，但是计算每一个词的概率时并不需要像word2vec那样消耗softmax那么大的计算量

• scalable to huge corpora 可拓展性

• 对于较小的语料库和向量也有很好的性能

### 3. Intrinsic evaluation

#### 3.2 Intrinsic Evaluation Tuning Example: Analogy Evaluations

• 词向量的维度 dimension of word vectors

• 语料库的大小 corpus size

• 语料库的种类 corpus source/type

• 上下文窗口大小 context window size

• 上下文对称性 context symmetry

#### 3.4 Further Reading: Dealing With Ambiguity

Improving Word Representations Via Global Context And Multiple Word Prototypes (Huang et al, 2012)

sentiment, named-entity recognition(NER), given a context and a central word, than classify the central word to be one of many classes.

#### 4.1 retraining word Vectors

If we retrain word vectors using the extrinsic task, we need to ensure that the training set is large enough to cover most words from the vocabulary.

### presentation

Linear algebraic structure of word senses with applications to polysemy

