文本分类系列0:NLTK学习和特征工程

计算语言:简单的统计

1
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

找出text1,《白鲸记》中的词monstrous,以及其上下文

1
text1.concordance("monstrous", width=40, lines=10)
Displaying 10 of 11 matches:
 was of a most monstrous size . ... Thi
 Touching that monstrous bulk of the wh
enish array of monstrous clubs and spea
 wondered what monstrous cannibal and s
e flood ; most monstrous and most mount
Moby Dick as a monstrous fable , or sti
PTER 55 Of the Monstrous Pictures of Wh
exion with the monstrous pictures of wh
ose still more monstrous stories of the
ed out of this monstrous cabinet there

找出text1中与monstrous具有相同语境的词。比如monstrous的上下文 the __ pictures, the __ size. 同样在text1中与monstrous类似的上下文的词。很好奇这个是怎么实现的?

1
2
3
4
5
def similar(self, word, num=20):
"""
Distributional similarity: find other words which appear in the
same contexts as the specified word; list most similar words first.
"""

1
text1.similar("monstrous")
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless

共用两个或两个以上词汇的上下文,如monstrous和very

1
text2.common_contexts(["monstrous", "very"])
a_pretty am_glad a_lucky is_pretty be_glad

自动检测出现在文本中的特定词,并显示同一上下文中出现的其他词。text4是《就职演说语料》,

1
2
if __name__ == "__main__":
text4.dispersion_plot(["citizens", "liberty", "freedom"])
<matplotlib.figure.Figure at 0x7f3794818588>

如果不使用 if name=="main" 的话会报错

object has no attribute 'show'
1
2
3
4
5
6
7
```。参考回答:https://stackoverflow.com/questions/36810604/nonetype-object-has-no-attribute-show


```python
fdist1 = FreqDist(text1)
vocabulary1 = list(fdist1.keys()) # keys() 返回key值组成的list
print(vocabulary1[:10])

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.']

需要加list,不然回报错,

'dict_keys' object is not subscriptable” ```
1
2
3
4
5
6
7
dict.keys() returns an iteratable but not indexable object. The most simple (but not so efficient) solution would be:


```python
# 同样的道理这里也需要加list,因为生成的<class 'dict_items'>z在python3中是迭代器
print(type(fdist1.items()))
print(list(fdist1.items())[:10])

<class 'dict_items'>
[('[', 3), ('Moby', 84), ('Dick', 84), ('by', 1137), ('Herman', 1), ('Melville', 1), ('1851', 3), (']', 1), ('ETYMOLOGY', 1), ('.', 6862)]
1
2
3
4
# dict.items() 实际上是将dict转换为可迭代对象list,list的对象是 ('[', 3), ('Moby', 84), ('Dick', 84), ('by', 1137)这样的
# 这下总能记住dict按照value排序了吧。。。尴尬,以前居然没弄懂??
fdist_sorted = sorted(fdist1.items(), key=lambda item:item[1], reverse=True)
print(fdist_sorted[:10])
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982)]
1
2
3
# 这个就是按照key排序。
fdist_sorted2 = sorted(fdist1.keys(), reverse=True)
print(fdist_sorted2[:10])
['zoology', 'zones', 'zoned', 'zone', 'zodiac', 'zig', 'zephyr', 'zeal', 'zay', 'zag']
1
fdist1.plot(20, cumulative=True)
png

png

可以看到高频词大都是无用的停用词

1
2
3
4
5
# 低频词 fdist.hapaxes() 出现次数为1的词
print(len(fdist1.hapaxes()))
for i in fdist1.hapaxes():
if fdist1[i] is not 1:
print("hh")
9002

可以看到低频词也很多,而且大都也是很无用的词。

词语搭配

1
list(bigrams(['more', 'is', 'sad', 'than', 'done']))
[('more', 'is'), ('is', 'sad'), ('sad', 'than'), ('than', 'done')]
1
text4.collocations(window_size=4)
United States; fellow citizens; four years; years ago; men women;
Federal Government; General Government; self government; Vice
President; American people; every citizen; within limits; Old World;
Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; one
another; Declaration Independence; protect defend

文本4是就职演说语料,可以看到n-grams能够很好的展现出文本的特性,说明n-grams是不错的特征。

collections()源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def collocations(self, num=20, window_size=2):
"""
Print collocations derived from the text, ignoring stopwords.

:seealso: find_collocations
:param num: The maximum number of collocations to print.
:type num: int
:param window_size: The number of tokens spanned by a collocation (default=2)
:type window_size: int
"""
if not ('_collocations' in self.__dict__ and self._num == num and self._window_size == window_size):
self._num = num
self._window_size = window_size

#print("Building collocations list")
from nltk.corpus import stopwords
ignored_words = stopwords.words('english')
finder = BigramCollocationFinder.from_words(self.tokens, window_size)
finder.apply_freq_filter(2)
finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
bigram_measures = BigramAssocMeasures()
self._collocations = finder.nbest(bigram_measures.likelihood_ratio, num)
colloc_strings = [w1+' '+w2 for w1, w2 in self._collocations]
print(tokenwrap(colloc_strings, separator="; "))

自动理解自然语言

  • 词义消歧 Ambiguity 关于词义消歧的理解可以看之前的笔记chapter12-句法分析
  • 指代消解 anaphora resolution
  • 自动问答
  • 机器翻译
  • 人机对话系统

获得文本语料和词汇资源

布朗语料库

1
from nltk.corpus import brown

有以下这些类别的文本

1
print(brown.categories())
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
1
2
3
4
import nltk
news_text = brown.words(categories="news")
fdist_news = nltk.FreqDist([w.lower() for w in news_text])
print(len(fdist_news))
13112

标注文本语料库

经过了标注的语料库,有词性标注、命名实体、句法结构、语义角色等。

分类和标注词汇

1
2
3
text = nltk.word_tokenize("and now for something completely differences!")
print(text)
print(nltk.pos_tag(text))
['and', 'now', 'for', 'something', 'completely', 'differences', '!']
[('and', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('differences', 'VBZ'), ('!', '.')]
词性标注

NLTK中采用的方法可参考:A Good Part-of-Speech Tagger in about 200 Lines of Python 对于一些同形同音异义词,通过词性标注能消除歧义.很多文本转语音系统通常需要进行词性标注,因为不同意思发音会不太一样。

1
2
3
4
text1 = nltk.word_tokenize("They refuse to permit us tpo obtain the refuse permit")
print(nltk.pos_tag(text1))
text2 = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
print(nltk.pos_tag(text2))
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('tpo', 'VB'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
获取已经标注好的语料库
1
print(nltk.corpus.brown.tagged_words())
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
1
2
print(nltk.corpus.treebank.tagged_words())
print(nltk.corpus.treebank.tagged_sents()[0])
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

查看brown语料库中新闻类最常见的词性

1
2
3
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.keys()
dict_keys(['DET', 'NOUN', 'ADJ', 'VERB', 'ADP', '.', 'ADV', 'CONJ', 'PRT', 'PRON', 'NUM', 'X'])

文本分类

朴素贝叶斯分类

选取特征,将名字的最后一个字母作为特征. 返回的字典称为特征集

1
2
3
def gender_features(word):
return {'last_letter':word[-1]}
gender_features('Shrek')
{'last_letter': 'k'}

定义一个特征提取器

1
2
3
4
from nltk.corpus import names
import random
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
1
2
print(nltk.corpus.names.words('male.txt')[:10])
print(names[:10])
['Aamir', 'Aaron', 'Abbey', 'Abbie', 'Abbot', 'Abbott', 'Abby', 'Abdel', 'Abdul', 'Abdulkarim']
[('Aamir', 'male'), ('Aaron', 'male'), ('Abbey', 'male'), ('Abbie', 'male'), ('Abbot', 'male'), ('Abbott', 'male'), ('Abby', 'male'), ('Abdel', 'male'), ('Abdul', 'male'), ('Abdulkarim', 'male')]

使用特征提取器处理names数据,并把数据集分为训练集和测试集

1
2
# 二分类
features = [(gender_features(n), g) for (n, g) in names]
1
train_set, test_set = features[500:], features[:500]
1
print(train_set[:10])
[({'last_letter': 'n'}, 'male'), ({'last_letter': 'e'}, 'male'), ({'last_letter': 'e'}, 'male'), ({'last_letter': 'b'}, 'male'), ({'last_letter': 'b'}, 'male'), ({'last_letter': 'e'}, 'male'), ({'last_letter': 'y'}, 'male'), ({'last_letter': 'y'}, 'male'), ({'last_letter': 't'}, 'male'), ({'last_letter': 'e'}, 'male')]
1
classifier = nltk.NaiveBayesClassifier.train(train_set)
1
2
# 预测一个未出现的名字
classifier.classify(gender_features('Pan'))
'male'
1
2
# 测试集上的准确率
print(nltk.classify.accuracy(classifier, test_set))
0.602
1
classifier.show_most_informative_features(5)
Most Informative Features
             last_letter = 'a'            female : male   =     35.5 : 1.0
             last_letter = 'k'              male : female =     34.1 : 1.0
             last_letter = 'f'              male : female =     15.9 : 1.0
             last_letter = 'p'              male : female =     13.5 : 1.0
             last_letter = 'v'              male : female =     12.7 : 1.0

构建包含所有实例特征的单独list会占用大量内存,所有应该把这些特征集成起来。 #### 定义一个特征提取器包含多个特征

1
2
3
4
5
6
7
8
9
# 添加多个特征
from nltk.classify import apply_features
def gender_features2(word):
features = {}
features['firstletter'] = word[0].lower()
features['lastletter'] = word[-1].lower()
for letter in 'abcdefghijklmnopqrstuvwxyz':
features["count(%s)"%letter] = word.lower().count(letter)
return features
1
2
print(gender_features2('xiepan'))
print(len(gender_features2('xiepan'))) # 有28个特征, 2+26=28
{'firstletter': 'x', 'lastletter': 'n', 'count(a)': 1, 'count(b)': 0, 'count(c)': 0, 'count(d)': 0, 'count(e)': 1, 'count(f)': 0, 'count(g)': 0, 'count(h)': 0, 'count(i)': 1, 'count(j)': 0, 'count(k)': 0, 'count(l)': 0, 'count(m)': 0, 'count(n)': 1, 'count(o)': 0, 'count(p)': 1, 'count(q)': 0, 'count(r)': 0, 'count(s)': 0, 'count(t)': 0, 'count(u)': 0, 'count(v)': 0, 'count(w)': 0, 'count(x)': 1, 'count(y)': 0, 'count(z)': 0}
28
1
2
3
# 对每个样本进行特征处理
features = [(gender_features(n), g) for (n,g) in names]
print(len(features))
7944
1
2
3
4
# 训练集,开发集和测试集
train_set = features[1500:]
dev_set = apply_features(gender_features2, names[500:1500])
test_set = apply_features(gender_features2, names[:500])
1
2
3
4
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, dev_set))
print(nltk.classify.accuracy(classifier, test_set))
print(nltk.classify.accuracy(classifier, train_set)) ## 明显过拟合了~
0.007
0.008
0.883302296710118

文档分类

1
2
3
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]
1
movie_reviews.categories()
['neg', 'pos']
1
2
3
4
neg_docu = movie_reviews.fileids('neg')
print(len(neg_docu)) # neg类别的文档数 1000
print(len(documents)) # 总的文档数 1000
len(movie_reviews.words(neg_docu[0])) # 第一个文件中单词数 879
1000
2000
879
1
random.shuffle(documents)

文档分类的特征提取器

所谓特征提取器实际上就是将文档原本的内容用认为选定的特征来表示。然后用分类器找出这些特征和对应类标签的映射关系。

那么什么样的特征才是好的特征,这就是特征工程了吧。

文本分类概述

文本分类,顾名思义,就是根据文本内容本身将文本归为不同的类别,通常是有监督学习的任务。根据文本内容的长短,有做句子、段落或者文章的分类;文本的长短不同可能会导致文本可抽取的特征上的略微差异,但是总体上来说,文本分类的核心都是如何从文本中抽取出能够体现文本特点的关键特征,抓取特征到类别之间的映射。 所以,特征工程就显得非常重要,特征找的好,分类效果也会大幅提高(当然前提是标注数据质量和数量也要合适,数据的好坏决定效果的下限,特征工程决定效果的上限)。

也许会有人问最近的深度学习技术能够避免我们构造特征这件事,为什么还需要特征工程?深度学习并不是万能的,在NLP领域深度学习技术取得的效果有限(毕竟语言是高阶抽象的信息,深度学习在图像、语音这些低阶具体的信息处理上更适合,因为在低阶具体的信息上构造特征是一件费力的事情),并不是否认深度学习在NLP领域上取得的成绩,工业界现在通用的做法都是会把深度学习模型作为系统的一个子模块(也是一维特征),和一些传统的基于统计的自然语言技术的特征,还有一些针对具体任务本身专门设计的特征,一起作为一个或多个模型(也称Ensemble,即模型集成)的输入,最终构成一个文本处理系统。

特征工程

那么,对于文本分类任务而言,工业界常用到的特征有哪些呢?下面用一张图以概括:

我主要将这些特征分为四个层次,由下往上,特征由抽象到具体,粒度从细到粗。我们希望能够从不同的角度和纬度来设计特征,以捕捉这些特征和类别之间的关系。下面详细介绍这四个层次上常用到的特征表示。