使用nltk分析文本情感

316 阅读 0 评论 209 点赞

我是靠谱客的博主土豪汽车，这篇文章主要介绍使用nltk分析文本情感，现在分享给大家，希望可以做个参考。

情感分析是NLP最受欢迎的应用之一。情感分析是指确定一段给定的文本是积极还是消极的过程。下面的代码是借用其他博主的，但是我对代码的输入数据格式以及类型做了一个简单解析供大家参考。另外我发在nltk在处理中文时的切分统计不是很好，中文和英文文本的情感分析思路上是一致的，不同之处在于中文在分析前需要进行分词，然后才能用nltk处理(nltk 的处理粒度一般是词)，因此在切分中文的时候可以采用jieba分词切分。中文分词之后，文本就是一个由每个词组成的长数组：[word1, word2, word3…… wordn]。之后就可以使用nltk 里面的各种方法来处理这个文本了。比如用FreqDist 统计文本词频，用bigrams 把文本变成双词组的形式：[(word1, word2), (word2, word3), (word3, word4)……(wordn-1, wordn)]。再之后就可以用这些来计算文本词语的信息熵、互信息等。再之后可以用这些来选择机器学习的特征，构建分类器，对文本进行分类。分词这块可以采用jieba分词来解决，相关对中文情感分析的稍后实验之后附上。

直接贴代码：

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews


# 分析句子的情感：情感分析是NLP最受欢迎的应用之一。情感分析是指确定一段给定的文本是积极还是消极的过程。
# 有一些场景中，我们还会将“中性“作为第三个选项。情感分析常用于发现人们对于一个特定主题的看法。


# 定义一个用于提取特征的函数
# 输入一段文本返回形如：{'It': True, 'movie': True, 'amazing': True, 'is': True, 'an': True}
# 返回类型是一个dict
def extract_features(word_list):
    return dict([(word, True) for word in word_list])


# 我们需要训练数据，这里将用NLTK提供的电影评论数据
if __name__ == '__main__':
    # 加载积极与消极评论
    positive_fileids = movie_reviews.fileids('pos')     # list类型 1000条数据 每一条是一个txt文件
    negative_fileids = movie_reviews.fileids('neg')
    # print(type(positive_fileids), len(negative_fileids))

    # 将这些评论数据分成积极评论和消极评论
    # movie_reviews.words(fileids=[f])表示每一个txt文本里面的内容，结果是单词的列表：['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
    # features_positive 结果为一个list
    # 结果形如：[({'shakesp: True, 'limit': True, 'mouth': True, ..., 'such': True, 'prophetic': True}, 'Positive'), ..., ({...}, 'Positive'), ...]
    features_positive = [(extract_features(movie_reviews.words(fileids=[f])), 'Positive') for f in positive_fileids]
    features_negative = [(extract_features(movie_reviews.words(fileids=[f])), 'Negative') for f in negative_fileids]

    # 分成训练数据集（80%）和测试数据集（20%）
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))  # 800
    threshold_negative = int(threshold_factor * len(features_negative))  # 800
    # 提取特征 800个积极文本800个消极文本构成训练集  200+200构成测试文本
    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
    print("n训练数据点的数量:", len(features_train))
    print("测试数据点的数量:", len(features_test))

    # 训练朴素贝叶斯分类器
    classifier = NaiveBayesClassifier.train(features_train)
    print("n分类器的准确性:", nltk.classify.util.accuracy(classifier, features_test))

    print("n十大信息最丰富的单词:")
    for item in classifier.most_informative_features()[:10]:
        print(item[0])

    # 输入一些简单的评论
    input_reviews = [
        "It is an amazing movie",
        "This is a dull movie. I would never recommend it to anyone.",
        "The cinematography is pretty great in this movie",
        "The direction was terrible and the story was all over the place"
    ]
    # 运行分类器，获得预测结果
    print("n预测:")
    for review in input_reviews:
        print("n评论:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        # 打印输出
        print("预测情绪:", pred_sentiment)
        print("可能性:", round(probdist.prob(pred_sentiment), 2))


'''
结果：
训练数据点的数量: 1600
测试数据点的数量: 400

分类器的准确性: 0.735

十大信息最丰富的单词:
outstanding
insulting
vulnerable
ludicrous
uninvolving
astounding
avoids
fascination
symbol
animators

预测:

评论: It is an amazing movie
预测情绪: Positive
可能性: 0.61

评论: This is a dull movie. I would never recommend it to anyone.
预测情绪: Negative
可能性: 0.77

评论: The cinematography is pretty great in this movie
预测情绪: Positive
可能性: 0.67

评论: The direction was terrible and the story was all over the place
预测情绪: Negative
可能性: 0.63

'''

参考转载：https://blog.csdn.net/qq_41251963/article/details/81702821