文本分类之情感分析– 停用词和惯用语

256 阅读 0 评论 169 点赞

我是靠谱客的博主大力信封，这篇文章主要介绍文本分类之情感分析– 停用词和惯用语，现在分享给大家，希望可以做个参考。

改善特征提取往往可以对分类的accuracy（和precision和召回率）有显著的正面影响。在本文中，我将评估word_feats的两项修改特征提取的方法：

过滤停用词
包含二元语法搭配

为了有效地做到这一点，我们将修改前面的代码，这样我们就可以使用任意的特征提取函数，它接收一个文件中的词，并返回特征字典。和以前一样，我们将使用这些特征来训练朴素贝叶斯分类器。

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
 
def evaluate_classifier(featx):
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')
 
    negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
    posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
 
    negcutoff = len(negfeats)*3/4
    poscutoff = len(posfeats)*3/4
 
    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
 
    classifier = NaiveBayesClassifier.train(trainfeats)
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
 
    for i, (feats, label) in enumerate(testfeats):
            refsets[label].add(i)
            observed = classifier.classify(feats)
            testsets[observed].add(i)
 
    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
    print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
    print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
    print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
    print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
    classifier.show_most_informative_features()

词袋特征抽取的基准

这是词袋特征选择的特征抽取。

def word_feats(words):
    return dict([(word, True) for word in words])
 
evaluate_classifier(word_feats)

结果与前面的文章中的一样，但是我已经把它们包括在这里以供参考：

accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

停用词过滤

停用词是通常被认为没用的词。大多数搜索引擎忽略这些词，因为他们是如此普遍，包括他们将大大增加索引的大小，而不会提高精度和召回率。 NLTK附带了一个停用词语料列表，其中包括128个英文停用词。让我们看看当我们过滤掉这些词，会发生什么。

from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
 
def stopword_filtered_word_feats(words):
    return dict([(word, True) for word in words if word not in stopset])
 
evaluate_classifier(stopword_filtered_word_feats)

一个停用词过滤的词袋的结果是：

accuracy: 0.726
pos precision: 0.649867374005
pos recall: 0.98
neg precision: 0.959349593496
neg recall: 0.472

accuracy下降了0.2％，pos 的 precision和负recall也下降了！显然，停用词将信息添加到情感分析分类。我并没有包括最具信息量的特征，因为他们并没有改变。

二元语法搭配

正如在对精度和召回率的文章的末尾提到的，包括二元语法(连词)将可能提高分类精度。假设是，人们说这样的话“不是很大”，由于它认为“伟大”作为一个单独的词，这种负面的表达被词袋模型解释为正的。

为了找到显著二元词组，我们可以使用nltk.collocations.BigramCollocationFinder和nltk.metrics.BigramAssocMeasures。该BigramCollocationFinder维持2个内部FreqDists，一个是独立单词的频率，另一个是二元词组的频率。一旦有了这些频率分布，它可以利用BigramAssocMeasures提供的打分函数为单独的二元词组打分，比如卡方。这些计分函数度量2个词的搭配关系，二元组基本上是与每个独立的词的出现一样频繁。

import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
 
def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])
 
evaluate_classifier(bigram_word_feats)

一些实验后，我发现，对每个文件使用200个最好的二元组产生了很大的成效：

accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
neg precision: 0.920212765957
neg recall: 0.692
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
   ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
      ('give', 'us') = True              neg : pos    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

是的，你没有看错，Matt Damon显然是在电影评论中正向情绪的最佳指标之一。但是，尽管如此，这也是值得的结果

精度可达近9％
POS精度已经增加了10％，召回率只有4％的降幅
负召回已经增加了21％，precision只有不到4％的降幅

所以看来二元组的假设是正确的，包括显著二元组可以提高分类效率。请注意，它是显著的二元组所以提高了效率。我试着用nltk.util.bigrams包括所有的二元组，结果只比基准高几个点。这证明了只包括显著的特征可以提高精度，相比使用所有功能的想法。在以后的文章中，我将尝试修剪下单个词的特征，只包括显著词。

原文：http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/