• Word Vectors
• Skip-gram
• Continuous Bag of words(CBOW)
• Negative Sampling
• Hierarchical SoftmaxM
• Word2Vec

### 1. How to represent words?

With word vectors, we can quite easily encode this ability in the vectors themselves (using distance measures such as Jaccard, Cosine, Eu-clidean, etc).

### 2. Word Vectors

encode word tokens into some vector(N-dimensional space, N << 13 million) that is sufficient to encode all semantics of our language. Each dimension would encode some meaning that we transfre using speech.

one-hot vector: V is the size of vocabulary.

each word is a completely independent entity.

### 3. Iteration Based Methods - Word2Vec

• 2 algorithms: continuous bag-of-words (CBOW) and skip-gram. CBOW aims to predict a center word from the surrounding context in terms of word vectors. Skip-gram does the opposite, and predicts the distribution (probability) of context words from a center word.

• 2 training methods: negative sampling and hierarchical softmax. Negative sampling defines an objective by sampling negative examples, while hierarchical softmax defines an objective using an efficient tree structure to compute probabilities for all the vocabulary.

language model

Unigram model :

$P(w_1,w_2,...,w_n) = \prod_{i=1}^nP(w_i)$

bigram model:

$P(w_1,w_2,...,w_n) = \prod_{i=1}^nP(w_i|w_{i-1})$

### 4. Continuous bag of words model(CBOW)

predict center word from the context.

$\prod_{c=1}^{n}P(w^{(c)}|w^{(c-m)},...,w^{(c-1)},w^{(c+1)},...,w^{(c+m)})$

negative log likelihood:

$J(\theta)= -\sum_{c=1}^{n}logP(w^{(c)}|w^{(c-m)},...,w^{(c-1)},w^{(c+1)},...,w^{(c+m)})$

the words of context to generate the center word is dependent:

$J(\theta) = \dfrac{1}{n}\sum_{c=1}^T\sum_{-m\le j\le m}logp(w_c|w_{c+j})$

how to present this probability???

To one sentence:

\begin{align} minimize J &= -logP(w_c|w_{c-m},..,w_{c-1},w_{c+1},...,w_{c+m})\\ &= -log P(u_c|\hat v)\tag{1}\\ &= -log \dfrac{exp(u_c^T\hat v)}{\sum_{j=1}^{|V|}exp(u_j^T\hat v)}\tag{2}\\ &= -u_c^T\hat v + log\sum_{j=1}^{|V|}exp(u_j^T\hat v) \end{align}

important: from word to vector the (1) to (2), using the softmax to present the probability

$P(u_c|\hat v) = \dfrac{exp(u_c^T\hat v)}{\sum_{j=1}^{|V|}exp(u_j^T\hat v)}\tag{* }$

#### 4.1 We can use an simple neural networt to train this matrix weights

input: $x^{(c)}\in R^{|V|\times 1}$, the input one-hot vector of context

labels: $y^{(c)}\in R^{|V|\times 1}$, the one hot vector of the known center word.

parameters:

• $w_i$: word i from vocabulary V
• $V \in R^{n\times |V|}$ input word matrix
• $v_i$:i-th column of $V$, the input vector representation of word $w_i$
• $U\in R^{|V|\times n}$: output word matrix
• $u_i$: i-th row of $U$, the output vector representation of word $w_i$
• n is an arbitrary size which defines the size of our embedding space

there are some differences with the figure....$W_1^{n\times |V|}$, $W_2^{|V|\times n}$

• input : $x_1.shape = (|V|, 1)$, $x_2.shape = (|V|, 1)$,...,$x_{2m}.shape = (|V|, 1)$

• $W_1$ : input matrix V, $W_1.shape = (n, |V|)$ each column is the representation of $w_i$

• hidden layer: $\hat v = \dfrac{V.dot(x_1)+...+V.dot(x_{2m})}{2m}$, $\hat v.shape = (n,1)$

• $W_2$ : output matrix U, $W_2.shape = (|V|, n)$ each row is the representation of $w_i$

• score: $u.shape = (|V|, 1)$

• output: $\hat y = softmax(u)$, $\hat y.shape=(|V|,1)$

• cross entropy: $H(\hat y, y) = -\sum_{j=1}^{|V|}y_jlog(\hat y_j)$

Because y is the one hot vector, and i is the index whose value is 1. $H(\hat y, y) = -y_ilog(\hat y_i)$

look at the paper word2vec Parameter Learning Explained, it is very cautious, very wonderful!!! The symbols are different from the above.

### 5. Skip-gram

\begin{align} minimize J &=-logP(w_{c-m},...,w_{c-1},w_{c+1},..,w_{c+m}|w_c)\\ &=-log\prod_{j=0,j\neq m }^{2m}P(w_{c-m+j}|w_c)\\ &=-\sum_{j=0,j\neq m}^{2m}logP(w_{c-m+j}|w_c)\tag{3}\\ &=-\sum_{j=0,j\neq m}^{2m}log\dfrac{exp(u_{c-m+j}^Tv_c)}{\sum_{k=1}^{|V|}exp(u_{k}^Tv_c)}\tag{4}\\ &=-\sum_{j=0,j\neq m}^{2m}u_{c-m+j}^Tv_c+2m\ log\sum_{k=1}^{|V|}exp(u_{k}^Tv_c) \end{align}

important: from word to vector the (1) to (2), using the softmax to present the probability $P(u_{c-m+j}|w_c) = \dfrac{exp(u_{c-m+j}^Tv_c)}{\sum_{j=1}^{|V|}exp(u_j^Tv_c)}\tag{* }$ V is the input matrix, U is the output matrix

#### 5.1 We can use the simple neural networks to train matrix weights

#### 5.2 inference and backpropagation

Skip-gram treats each context word equally: the models computes the probability for each word of appearing in the context independently of its distance to the center word.

### 6. Optimizing Computational Efficiency

#### 6.2 Negative Sampling

loss function: $E = -log\sigma(v'^T_{w_O}h)-\sum_{w_j\in W_{neg}}log\sigma(-v'^T_{w_j}h)$

• in the CBOW, $h=\dfrac{1}{C}\sum_{c=1}^Cv_{w_c^T}$
• in the skip-gram, $h=v_{w_I}^T$
• how to choose the K negative samples?
• As described in (Mikolov et al., 2013b), word2vec uses a unigram distribution raised to the 3/4th power for the best quality of results.

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.
For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.
This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

reference: