$\dfrac{1}{N}\sum_{n=1}^Ny_nlog(f(BAx_n))$

• N表示文本数量，训练时就是Batch size吧？
• $x_n$ 表示第n个文本的 normalized bag of features
• $y_n$ 表示第n个文本的类标签
• A is the look up table over n-gram. 类似于attention中的权重吧
• B is the weight matrix

• 这种模型的优点在于简单，无论训练还是预测的速度都很快，比其他深度学习模型高了几个量级
• 缺点是模型过于简单，准确度较低。

#### Abstract

Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. 之前的模型在用离散的向量表示单词时都忽略了单词的形态。

In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. 这篇文章提出了一个skipgram模型,其中每一个单词表示为组成这个单词的字袋模型 a bag of character n-grams. 一个单词的词向量表示为这些 n-grams表示的总和。

Our main contribution is to introduce an extension of the continuous skipgram model (Mikolov et al., 2013b), which takes into account subword information. We evaluate this model on nine languages exhibiting different morphologies, showing the benefit of our approach. 这篇文章可以看作是word2vec的拓展，主要是针对一些形态特别复杂的语言。

word2vec在词汇建模方面产生了巨大的贡献，然而其依赖于大量的文本数据进行学习，如果一个word出现次数较少那么学到的vector质量也不理想。针对这一问题作者提出使用subword信息来弥补这一问题，简单来说就是通过词缀的vector来表示词。比如unofficial是个低频词，其数据量不足以训练出高质量的vector，但是可以通过un+official这两个高频的词缀学习到不错的vector。方法上，本文沿用了word2vec的skip-gram模型，主要区别体现在特征上。word2vec使用word作为最基本的单位，即通过中心词预测其上下文中的其他词汇。而subword model使用字母n-gram作为单位，本文n取值为3~6。这样每个词汇就可以表示成一串字母n-gram，一个词的embedding表示为其所有n-gram的和。这样我们训练也从用中心词的embedding预测目标词，转变成用中心词的n-gram embedding预测目标词。

##### Morphological word representations

• [Andrei Alexandrescu and Katrin Kirchhoff. 2006. Factored neural language models. In Proc. NAACL] introduced factored neural language models. 因式分解模型
• words are represented as sets of features.
• These features might include morphological information

• Schütze (1993) learned representations of character four-grams through singular value decomposition, and derived representations for words by summing the four-grams representations. 这篇文正的工作跟本文的方法是比较接近的。

#### General Model

giving a sequence of words $w_1, w_2,...,w_T$

we are given a scoring function s which maps pairs of (word, context) to scores in R. $p(w_c|w_t)=\dfrac{e^{s(w_t,w_c)}}{\sum_{j=1}^We^{s(w_t,j)}}$

The problem of predicting context words can instead be framed as a set of independent binary classification tasks. Then the goal is to independently predict the presence (or absence) of context words. For the word at position t we consider all context words as positive examples and sample negatives at random from the dictionary.

$N_{t,c}$ is a set of negative examples sampled from the vocabulary. 怎么选负采样呢？　每个单词都被给予一个等于它频率的权重（单词出现的数目）的3/4次方。选择某个单词的概率就是它的权重除以所有单词权重之和。 $p(w_i)=\dfrac{f(w_i)^{3/4}}{\sum_{j=0}^W(f(w_j)^{3/4})}$

Then the score can be computed as the scalar product between word and context vectors as: $s(w_t,w_c) = u_{w_t}^Tv_{w_v}$

#### Subword model

By using a distinct vector representation for each word, the skipgram model ignores the internal structure of words. In this section, we propose a different scoring function s, in order to take into account this information. 单词的离散词向量表示是忽略了单词内部的结构信息的，也就是其字母组成。

$z_g$ 表示n-gram g 的向量表示。那么 scoring function: $s(w,c)=\sum_{g\in G_w}z_g^Tv_c$

### 需要注意的问题

• 代码实现中对于sentence的向量表示，是unigram的平均值，如果要让效果更好，可以添加bigram, trigram等。

• tf.train.exponential_decay

• tf.nn.nce_loss