• Word Vectors
  • Skip-gram
  • Continuous Bag of words(CBOW)
  • Negative Sampling
  • Hierarchical SoftmaxM
  • Word2Vec

1. How to represent words?

With word vectors, we can quite easily encode this ability in the vectors themselves (using distance measures such as Jaccard, Cosine, Eu-clidean, etc).

2. Word Vectors

encode word tokens into some vector(N-dimensional space, N << 13 million) that is sufficient to encode all semantics of our language. Each dimension would encode some meaning that we transfre using speech.

one-hot vector: V is the size of vocabulary.

each word is a completely independent entity.

3. Iteration Based Methods - Word2Vec

  • 2 algorithms: continuous bag-of-words (CBOW) and skip-gram. CBOW aims to predict a center word from the surrounding context in terms of word vectors. Skip-gram does the opposite, and predicts the distribution (probability) of context words from a center word.

  • 2 training methods: negative sampling and hierarchical softmax. Negative sampling defines an objective by sampling negative examples, while hierarchical softmax defines an objective using an efficient tree structure to compute probabilities for all the vocabulary.

language model

Unigram model :

\[P(w_1,w_2,...,w_n) = \prod_{i=1}^nP(w_i)\]

bigram model:

\[P(w_1,w_2,...,w_n) = \prod_{i=1}^nP(w_i|w_{i-1})\]

4. Continuous bag of words model(CBOW)

predict center word from the context.

\[ \prod_{c=1}^{n}P(w^{(c)}|w^{(c-m)},...,w^{(c-1)},w^{(c+1)},...,w^{(c+m)})\]

negative log likelihood:

\[J(\theta)= -\sum_{c=1}^{n}logP(w^{(c)}|w^{(c-m)},...,w^{(c-1)},w^{(c+1)},...,w^{(c+m)})\]

the words of context to generate the center word is dependent:

\[J(\theta) = \dfrac{1}{n}\sum_{c=1}^T\sum_{-m\le j\le m}logp(w_c|w_{c+j})\]

how to present this probability???

To one sentence:

\[ \begin{align} minimize J &= -logP(w_c|w_{c-m},..,w_{c-1},w_{c+1},...,w_{c+m})\\ &= -log P(u_c|\hat v)\tag{1}\\ &= -log \dfrac{exp(u_c^T\hat v)}{\sum_{j=1}^{|V|}exp(u_j^T\hat v)}\tag{2}\\ &= -u_c^T\hat v + log\sum_{j=1}^{|V|}exp(u_j^T\hat v) \end{align} \]

important: from word to vector the (1) to (2), using the softmax to present the probability

\[P(u_c|\hat v) = \dfrac{exp(u_c^T\hat v)}{\sum_{j=1}^{|V|}exp(u_j^T\hat v)}\tag{* }\]

其实word2vec可以理解为两个word,他们的上下文越相似,那么他们俩的词向量表示也就越相似.比如 he 和 she 大多数情况下他们的语境,也就是上下文出现的单词v都是很接近的,那么同样与这些词内积得到的概率就会差不多~说到底,也是个频率统计的方法,只不过用了无监督学习这个方式来得到distribution vector了~从这个角度理解就很合理了。错误的理解是 u 和v 出现在同一个窗口,他们的内积的概率就越大,这无法解释任何东西。

4.1 We can use an simple neural networt to train this matrix weights

input: \(x^{(c)}\in R^{|V|\times 1}\), the input one-hot vector of context

labels: \(y^{(c)}\in R^{|V|\times 1}\), the one hot vector of the known center word.


  • \(w_i\): word i from vocabulary V
  • \(V \in R^{n\times |V|}\) input word matrix
  • \(v_i\):i-th column of \(V\), the input vector representation of word \(w_i\)
  • \(U\in R^{|V|\times n}\): output word matrix
  • \(u_i\): i-th row of \(U\), the output vector representation of word \(w_i\)
  • n is an arbitrary size which defines the size of our embedding space

there are some differences with the figure....\(W_1^{n\times |V|}\), \(W_2^{|V|\times n}\)

  • input : \(x_1.shape = (|V|, 1)\), \(x_2.shape = (|V|, 1)\),...,\(x_{2m}.shape = (|V|, 1)\)

  • \(W_1\) : input matrix V, \(W_1.shape = (n, |V|)\) each column is the representation of \(w_i\)

  • hidden layer: \(\hat v = \dfrac{V.dot(x_1)+...+V.dot(x_{2m})}{2m}\), \(\hat v.shape = (n,1)\)

  • \(W_2\) : output matrix U, \(W_2.shape = (|V|, n)\) each row is the representation of \(w_i\)

  • score: \(u.shape = (|V|, 1)\)

  • output: \(\hat y = softmax(u)\), \(\hat y.shape=(|V|,1)\)

  • cross entropy: \[H(\hat y, y) = -\sum_{j=1}^{|V|}y_jlog(\hat y_j)\]

Because y is the one hot vector, and i is the index whose value is 1. \[H(\hat y, y) = -y_ilog(\hat y_i)\]

look at the paper word2vec Parameter Learning Explained, it is very cautious, very wonderful!!! The symbols are different from the above.

4.2 one word context inference

4.3 one word context backpropagation

4.4 multi-words context

5. Skip-gram

\[ \begin{align} minimize J &=-logP(w_{c-m},...,w_{c-1},w_{c+1},..,w_{c+m}|w_c)\\ &=-log\prod_{j=0,j\neq m }^{2m}P(w_{c-m+j}|w_c)\\ &=-\sum_{j=0,j\neq m}^{2m}logP(w_{c-m+j}|w_c)\tag{3}\\ &=-\sum_{j=0,j\neq m}^{2m}log\dfrac{exp(u_{c-m+j}^Tv_c)}{\sum_{k=1}^{|V|}exp(u_{k}^Tv_c)}\tag{4}\\ &=-\sum_{j=0,j\neq m}^{2m}u_{c-m+j}^Tv_c+2m\ log\sum_{k=1}^{|V|}exp(u_{k}^Tv_c) \end{align} \]

important: from word to vector the (1) to (2), using the softmax to present the probability \[P(u_{c-m+j}|w_c) = \dfrac{exp(u_{c-m+j}^Tv_c)}{\sum_{j=1}^{|V|}exp(u_j^Tv_c)}\tag{* }\] V is the input matrix, U is the output matrix

5.1 We can use the simple neural networks to train matrix weights

#### 5.2 inference and backpropagation

Skip-gram treats each context word equally: the models computes the probability for each word of appearing in the context independently of its distance to the center word.

6. Optimizing Computational Efficiency

6.1 Hirarchical Softmax

6.2 Negative Sampling

loss function: \[E = -log\sigma(v'^T_{w_O}h)-\sum_{w_j\in W_{neg}}log\sigma(-v'^T_{w_j}h)\]

  • in the CBOW, \(h=\dfrac{1}{C}\sum_{c=1}^Cv_{w_c^T}\)
  • in the skip-gram, \(h=v_{w_I}^T\)
  • how to choose the K negative samples?
    • As described in (Mikolov et al., 2013b), word2vec uses a unigram distribution raised to the 3/4th power for the best quality of results.

关于负采样的原理的理解: Intuitive explanation of Noise Contrastive Estimation (NCE) loss?

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.
For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.
This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.