文本处理之贝叶斯垃圾邮件分类

56 阅读 0 评论 37 点赞

我是靠谱客的博主落寞鸡，最近开发中收集的这篇文章主要介绍文本处理之贝叶斯垃圾邮件分类，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

本文所讲解的是如何通过Python将文本读取,并且将每一个文本生成对应的词向量并返回. 文章的背景是将50封邮件(包含25封正常邮件,25封垃圾邮件)通过贝叶斯算法对其进行分类.

主要分为如下几个部分:
①读取所有邮件;
②建立词汇表;
③生成没封邮件对应的词向量(词集模型);
④用sklearn中的朴素贝叶斯算法进行分类;
⑤生成性能评估报告

1.函数介绍

下面先介绍需要用到的功能函数

1.1建立词汇表

思路:用所给的文本建立一个词汇表;就是将用所有出现的单词构成一个不重复的集合,即不含同一个单词.

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)
    
postingList=[['my', 'dog', 'dog','has']]
print createVocabList(postingList)
>> ['has', 'my', 'dog']

1.2 将所有的大写字母转换成小写字母,并且去掉长度小于两个字符的单词

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]
                            # 去掉长度小于两个字符的单词,2可以自己调节

s = 'i Love YYUU'
print textParse(s)
>> ['love', 'yyuu']

1.3将每一个文本变成一个词向量

构建词向量有两种方式:第一种是用文本里面出现的单词,同词汇表向量进行对比,如果出现在词汇表中,则对应位置为1,反之为0.这种方式只管有无出现,不管出现次数,称为词集模型(set-of-words model);另外一种就是,同时也统计出现次数,称为词袋模型(bag-of-words model).

def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

vocabulary = ['wo','do','like','what','go']
text = ['do','go','what','do']
print setOfWords2Vec(vocabulary,text)
>> [0, 1, 0, 1, 1]

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

vocabulary = ['wo','do','like','what','go']
text = ['do','go','what','do']
print setOfWords2Vec(vocabulary,text)
>> [0, 2, 0, 1, 1]

2.整合函数

将上面三个函数写在一起;下面的操作方式只是针对本例,但是只要稍作修改同样能够适应其它地方.

def createVocabList(dataSet):# 建立词汇表
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def setOfWords2Vec(vocabList, inputSet):# 建立词向量
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

def preProcessing():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)# 读取文本
        classList.append(1)# 读取每个文本的标签
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary# 生成词向表
    data = []
    target = classList
    for docIndex in range(50):# 本例一共有50个文本
        data.append(setOfWords2Vec(vocabList,docList[docIndex]))生成词向量
    return data,target#返回处理好的词向量和标签

3.训练并预测

import textProcess as tp
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report

data,target= tp.preProcessing()


X_train,X_test,y_train,y_test = train_test_split(data,target,test_size=0.25)

mnb = MultinomialNB()
mnb.fit(X_train,y_train)
y_pre = mnb.predict((X_test))
print y_pre # 预测结果
print y_test # 实际结果
print 'The accuracy of Naive Bayes Classifier is',mnb.score(X_test,y_test)
print classification_report(y_test,y_pre)

源码及数据集

参考