概述
1. 数据来源
所用的数据是分类好的数据,详细描述见SMS Spam Collection v. 1,可以从github下载,数据在第4章。每一行数据包括包括两列,使用逗号隔开, 第1列是分类(lable),第2列是文本。
sms = pd.read_csv(filename, sep=',', header=0, names=['label','text'])
sms.head
Out[5]:
<bound method DataFrame.head of label text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam FreeMsg Hey there darling it's been 3 week's n...
6 ham Even my brother is not like to speak with me. ...
7 ham As per your request 'Melle Melle (Oru Minnamin...
8 spam WINNER!! As a valued network customer you have...
9 spam Had your mobile 11 months or more? U R entitle...
10 ham I'm gonna be home soon and i don't want to tal...
2. 数据准备
总共有5574行数据,随机从中抽取500行作为测试数据集,其它的作为训练数据集,为此定义了一个函数。运行后发现这个函数有一点小问题,它取不到500个数据,会少几个,分析原因,应该是产生的随机数有重复导致的。n为抽取的数据行数,size是整个数据集的行数。
def randomSequence(n, size):
result = [0 for i in range(size)]
for i in range(n):
x = random.randrange(0, size-1, 1)
result[x] = 1
return result
3. 特征提取
进行文本分类,在调用算法之前需要将文本内容转换成特征。 scikit-learn提供的CountVectorizer, TfidfTransformer两个类可以完成特征的提取。测试数据集共用了训练数据集产生的词汇表。
4.完整的代码
# -*- coding: utf-8 -*-
import random
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
#生成选择训练数据和测试数据的随机序列
def randomSequence(n, size):
result = [0 for i in range(size)]
for i in range(n):
x = random.randrange(0, size-1, 1)
result[x] = 1
return result
if __name__ == '__main__':
#读数据
filename = 'data/sms_spam.csv'
sms = pd.read_csv(filename, sep=',', header=0, names=['label','text'])
#拆分训练数据集和测试数据集
size = len(sms)
sequence = randomSequence(500, size)
sms_train_mask = [sequence[i]==0 for i in range(size)]
sms_train = sms[sms_train_mask]
sms_test_mask = [sequence[i]==1 for i in range(size)]
sms_test = sms[sms_test_mask]
#文本转换成TF-IDF向量
train_labels = sms_train['label'].values
train_features = sms_train['text'].values
count_v1= CountVectorizer(stop_words = 'english', max_df = 0.5, decode_error = 'ignore')
counts_train = count_v1.fit_transform(train_features)
#print(count_v1.get_feature_names())
#repr(counts_train.shape)
tfidftransformer = TfidfTransformer()
tfidf_train = tfidftransformer.fit(counts_train).transform(counts_train)
test_labels = sms_test['label'].values
test_features = sms_test['text'].values
count_v2 = CountVectorizer(vocabulary=count_v1.vocabulary_,stop_words = 'english', max_df = 0.5, decode_error = 'ignore')
counts_test = count_v2.fit_transform(test_features)
tfidf_test = tfidftransformer.fit(counts_test).transform(counts_test)
#训练
clf = MultinomialNB(alpha = 0.01)
clf.fit(tfidf_train, train_labels)
#预测
predict_result = clf.predict(tfidf_test)
#print(predict_result)
#正确率
correct = [test_labels[i]==predict_result[i] for i in range(len(predict_result))]
r = len(predict_result)
t = correct.count(True)
f = correct.count(False)
print(r, t, f, t/float(r) )
以上用的是贝叶斯分类算法,也可以换其他算法。
运行结果
runfile('E:/MyProject/_python/ScikitLearn/NaiveBayes.py', wdir='E:/MyProject/_python/ScikitLearn')
(476, 468, 8, 0.9831932773109243)
最后
以上就是等待大神为你收集整理的使用scikit-learn进行文本分类的全部内容,希望文章能够帮你解决使用scikit-learn进行文本分类所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复