情感分类简单易懂

51 阅读 0 评论 34 点赞

我是靠谱客的博主无情信封，最近开发中收集的这篇文章主要介绍情感分类简单易懂，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

1.前言

代码用到的数据集：链接：百度云盘提取码：61h4

2.代码

import numpy as np
import pandas as pd

path = r"E:贪心学院课件资料课件资料3.9[直播] 情感分析技术实战课程PPTLesson15-EmotionDetectionLesson15-EmotionDetectionEmotion-DetectorISEAR.csv"

data = pd.read_csv(path, header = None)
# data.head()

在这里插入图片描述

# data.shape    # (7516, 3)

label = data[0].values.tolist()
# label[:10]   # ['joy',
 'fear',
 'anger',
 'sadness',
 'disgust',
 'shame',
 'guilt',
 'joy',
 'fear',
 'anger']

#data[1].values  # 一维数组
###
array(['On days when I feel close to my partner and other friends.   nWhen I feel at peace with myself and also experience a close  ncontact with people whom I regard greatly.',
       'Every time I imagine that someone I love or I could contact a  nserious illness, even death.',
       'When I had been obviously unjustly treated and had no possibility  nof elucidating this.',
       ...,
       'I was at home and I heard a loud sound of spitting outside the  ndoor.  I thought that one of my family members would step on the spit  nand bring the germs in the house.',
       'I did not do the homework that the teacher had asked us to do.  I  nwas scolded immediately.',
       'I had shouted at my younger brother and he was always afraid when  nI called out loudly.'],
      dtype=object)

text = data[1].values.tolist()  # 一维列表
# text[:2]   # text 列表，
['On days when I feel close to my partner and other friends.   nWhen I feel at peace with myself and also experience a close  ncontact with people whom I regard greatly.',
 'Every time I imagine that someone I love or I could contact a  nserious illness, even death.']

# 划分训练集和测试集
from sklearn.model_selection import train_test_split

xtrain,xtest, ytrain, ytest = train_test_split( text, label, test_size = 0.2, random_state = 42)
print(len(xtrain))  # 6012
print(len(xtest))   # 1504

## 特征提取
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
Xtrain = tfidf.fit_transform(xtrain)
Xtest = tfidf.transform(xtest)   # 这里不能用 fit_transform()
#print(type(Xtest))   # 稀疏矩阵
#print(Xtest)     # xtest 只有1504 个样本 (0, 7588)表示第 0 个样本，第 7588 词索引 的 idf 值是 0.311520
# 其他特征可以提取吗？
# 1. 词性的特征
# 2. n-gram

在这里插入图片描述

## 选择模型
from sklearn .linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV  # 带交叉验证的网格搜索

params = {"C": [0.01, 0.1, 0.5, 0.005, 1, 0.2, 2, 3, 4, 5, 6]}
lr = LogisticRegression()
grid = GridSearchCV(lr, param_grid=params, cv = 10, n_jobs=2)
grid = grid.fit(Xtrain, ytrain)
grid.best_params_

grid.score(Xtest, ytest)

from sklearn.metrics import confusion_matrix  #专用于多分类
confusion_matrix(ytest, grid.predict(Xtest))

在这里插入图片描述

3.代码注意事项

3.1 特征提取函数

TfidfVectorizer 可以参考CountVectorizer和TfidfVectorizer学习笔记

3.2 fit_transform 和 transform

Xtrain = tfidf.fit_transform(xtrain)
这一行表示，用 xtrain 去训练 tfidf 模型，那么这个过程一定会生成针对 xtrain 样本的词库。模型训练好了之后，再把它用到 xtrain 样本上生成针对xtrain的 tfidf 值。

Xtest = tfidf.transform(xtest)
这一行表示，将上面fit训练好的模型用在测试集上，这个过程所用的词库，仍是有 xtrain 生成的，所以如果这里用 fit_transform 的化，词库就变成是由 xtest 生成的了，这样是不正确的。

3.3 逻辑回归的 C参数

C 是逻辑回归的惩罚项
在 sklearn 中，逻辑回归的损失函数如下：C在原损失函数前面，而不是以往在正则项的前面
在这里插入图片描述
这样 C就会有个特点：
值越大，惩罚就越小

3.4 混淆矩阵

混淆矩阵描述的是实际类别与预测类别的关系，可以从它对角线的值看出来，哪个类的预测效果好，哪个类的预测效果不好，针对不好预测的类别，我们可以提取更多特征。
Accuracy 反映的是各个类别预测的平均值，因此我们无从得知哪个类预测得最好。
如图：
在这里插入图片描述
Cat 实际上有8 只，但预测正确为Cat的只有5只，3只预测为 Dog，0 只预测为 Rabbit
Dog 实际上有6只，但预测正确为 Dog 的只有3只，2只预测为 Cat，1 只预测为 Rabbit
Rabbit 实际上有13只，但预测正确为Rabbit 的只有 11 只，0 只预测为Cat, 2 只预测为 Dog