# 论文笔记-Explicit Semantic Analysis

paper:

## Motivation

Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts.

### 传统的方法：

1. 词袋模型: 将 text 看作是 unordered bags of words, 每一个单词看作是一维特征。但是这并不能解决 NLP 中的两个主要问题： 一词多义和同义词（polysemy and synonymy）。
1. 隐语义分析：Latent Semantic Analysis (LSA)

LSA is a purely statistical technique, which leverages word co-occurrence information from a large unlabeled corpus of text. LSA does not use any explicit human-organized knowledge; rather, it “learns” its representation by applying Singular Value Decomposition (SVD) to the words-by-documents co-occurrence matrix. LSA is essentially a dimensionality reduction technique that identifies a number of most prominent dimensions in the data, which are assumed to correspond to “latent concepts”. Meanings of words and documents are then represented in the space defined by these concepts.

LSA 是一种纯粹的统计技术，它利用来自大量未标记文本语料库的单词共现信息。 LSA不使用任何明确的人类组织知识; 相反，它通过将奇异值分解（SVD）应用于逐个文档的共现矩阵来“学习”其表示。 LSA本质上是一种降维技术，它识别数据中的许多最突出的维度，假设它们对应于“潜在概念”。 然后，在这些概念定义的空间中表示单词和文档的含义。

1. 词汇数据库，WordNet.

However, lexical resources offer little information about the different word senses, thus making word sense disambiguation nearly impossible to achieve.Another drawback of such approaches is that creation of lexical resources requires lexicographic expertise as well as a lot of time and effort, and consequently such resources cover only a small fragment of the language lexicon. Specifically, such resources contain few proper names, neologisms, slang, and domain-specific technical terms. Furthermore, these resources have strong lexical orientation in that they predominantly contain information about individual words, but little world knowledge in general.

### concept 定义

Observe that an encyclopedia consists of a large collection of articles, each of which provides a comprehensive exposition focused on a single topic. Thus, we view an encyclopedia as a collection of concepts (corresponding to articles), each accompanied with a large body of text (the article contents).

example:

Ben Bernanke, Federal Reserve, Chairman of the Federal Reserve, Alan Greenspan (Bernanke’s predecessor), Monetarism (an economic theory of money supply and central banking), inflation and deflation.

ESA 对一个 texts 的表示是 wiki 中所有的 concept 的 weighted combination，这里为了展示方便，只列举了最相关的一些 concept.

## ESA(explicit semantic analysis)

1. the set of basic concepts

2. the algorithm that maps text fragments into interpretation vectors

## 如何构建 concept 集合

1.using Wikipedia as a Repository of Basic Concepts

2.building a semantic interpreter

$$T[i,j]=tf(t_i, d_j)\cdot log\dfrac{n}{df_i}$$

TF 表示在文档 $d_j$ 中，单词 $t_i$ 出现的频率。

$$tf(t_i, d_j)=\begin{cases} 1 + log\ count(t_i, d_j), &\text{if count(t_i, d_j) > 0} \ 0, &\text{otherwise} \end{cases}$$

IDF 表示逆文档频率。反应一个词在不同的文档中出现的频率越大，那么它的 IDF 值应该低，比如介词“to”。而反过来如果一个词在比较少的文本中出现，那么它的 IDF 值应该高。

$$IDF=log\dfrac{n}{df_i}$$

$df_i=|{d_k:t_i\in d_k}|$ 表示出现该单词的文档个数，n 表示总的文档个数。

$$T[i,j]\leftarrow \dfrac{T[i,j]}{\sqrt{\sum_{l=1}^r T[i,j]^2}}$$

r 表示单词的总量。也就是除以所有单词对应的向量二范数之和平方。