    本系列是在作者学习《机器学习系统设计》([美] WilliRichert)过程中的思考与实践,全书通过Python从数据处理,到特征工程,再到模型选择,把机器学习解决问题的过程一一呈现。书中设计的源代码和数据集已上传到我的资源:http://download.csdn.net/detail/solomon1558/8971649



1. 统计词语


    01.   txt     This is a toy post about machine learning.Actually, it contains not much interesting stuff.

    02.   txt     Imaging databases provide storagecapabilities.

    03.   txt     Most imaging databases safe imagespermanently.

    04.   txt     Imaging databases store data.

    05.   txt     Imaging databases store data. Imagingdatabases store data. Imaging databases store data.

    在这个文档数据集中,我们想要找到和文档”imaging database”最相近的文档。为了将原始文本转换成聚类算法可以使用的特征数据,首先需要使用词袋(bag-of-word)方法来衡量文本间相似性,最终生成每个文本的特征向量。


posts = [open(os.path.join(DIR, f)).read() for f in os.listdir(DIR)]
vectorizer = CountVectorizer(min_df=1, stop_words="english")
X_train = vectorizer.fit_transform(posts)



X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
print ("#sample: %d, #feature: %d" % (num_samples, num_features))


            #sample: 5, #feature: 25

            [u'about', u'actually', u'capabilities', u'contains',u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it',u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post',u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']


#a new post
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])


#------- calculate raw distances betwee new and old posts and record the shortest one-------------------------
def dist_raw(v1, v2):
    delta = v1 - v2
    return sp.linalg.norm(delta.toarray())
best_doc = None
best_dist = sys.maxint
best_i = None
for i in range(0, num_samples):
    post = posts[i]
    if post == new_post:
    post_vec = X_train.getrow(i)
    d = dist_raw(post_vec, new_post_vec)
    print "=== Post %i with dist = %.2f: %s" % (i, d, post)
    if d<best_dist:
        best_dist = d
        best_i = i
print("Best post is %i with dist=%.2f" % (best_i, best_dist))

            === Post 0 with dist = 4.00:This is a toy post about machine learning. Actually, it contains not muchinteresting stuff.

            === Post 1 with dist =1.73:Imaging databases provide storage capabilities.

            === Post 2 with dist =2.00:Most imaging databases safe images permanently.

            === Post 3 with dist =1.41:Imaging databases store data.

            === Post 4 with dist =5.10:Imaging databases store data. Imaging databases store data. Imaging databasesstore data.

            Best post is 3 with dist=1.41


#-------case study: why post 4 and post 5 different ?-----------

            [[0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]]

            [[0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]

2. 文本处理

2.1 词频向量归一化


def dist_norm(v1, v2):
    v1_normalized = v1 / sp.linalg.norm(v1.toarray())
    v2_normalized = v2 / sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

            === Post 0 with dist = 1.41: This is a toy post aboutmachine learning. Actually, it contains not much interesting stuff.

            === Post 1 with dist = 0.86: Imaging databases providestorage capabilities.

            === Post 2 with dist = 0.92: Most imaging databasessafe images permanently.

            === Post 3 with dist = 0.77:Imagingdatabases store data.

            === Post 4 with dist = 0.77:Imaging databases store data. Imaging databases store data. Imaging databasesstore data.

            Best post is 3 with dist=0.77   


2.2 排除停用词


vectorizer = CountVectorizer(min_df=1, stop_words='english')




import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')


(1)   在预处理阶段将原始文档变成小写字母形式(这在父类中完成);

(2)   在词语切分阶段提取所有单词;

(3)   将每个词语转换成词干形式。

3. 计算TF-IDF




import scipy as sp
def tfidf(t, d, D):
    tf = float(d.count(t)) / sum(d.count(w) for w in set(d))
    idf = sp.log(float(len(D)) / (len([doc for doc in D if t in doc])))
    return tf * idf



import os
import sys
import scipy as sp
from sklearn.feature_extraction.text import CountVectorizer

DIR = r"../data/toy"
posts = [open(os.path.join(DIR, f)).read() for f in os.listdir(DIR)]
new_post = "imaging databases"

import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
#vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')

from sklearn.feature_extraction.text import TfidfVectorizer

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english')
X_train = vectorizer.fit_transform(posts)

num_samples, num_features = X_train.shape
print("#samples: %d, #features: %d" % (num_samples, num_features))

new_post_vec = vectorizer.transform([new_post])
print(new_post_vec, type(new_post_vec))

def dist_raw(v1, v2):
    delta = v1 - v2
    return sp.linalg.norm(delta.toarray())
def dist_norm(v1, v2):
    v1_normalized = v1 / sp.linalg.norm(v1.toarray())
    v2_normalized = v2 / sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

dist = dist_norm
best_dist = sys.maxsize
best_i = None

for i in range(0, num_samples):
    post = posts[i]
    if post == new_post:
    post_vec = X_train.getrow(i)
    d = dist(post_vec, new_post_vec)
    print("=== Post %i with dist=%.2f: %s" % (i, d, post))
    if d < best_dist:
        best_dist = d
        best_i = i
print("Best post is %i with dist=%.2f" % (best_i, best_dist))

4. 总结


 (1)   切分文本;

 (2)   扔掉出现过于频繁,而又对匹配相关文档没有帮助的词语;

 (3)   扔掉出现频率很低,只有很小可能出现在未来帖子中的词语;

 (4)   统计剩余的词语;

 (5)   考虑整个预料集合,从词频统计中计算TF-IDF值。


 (1)   它并不涵盖词语之间的关联关系。采用之前的向量化方法,文本”Car hits wall”和”Wall hits car”会有相同的特征向量。

 (2)   它没法捕捉否定关系。例如”I will eat ice cream”和”I will not eat ice cream”,尽管它们意思截然相反,但从特征向量来看它们非常相似。这个问题其实很容易解决,只需要既统计单个词语(又叫unigrams),又考虑成队的词语(bigrams)或者trigrams(一行中的三个词语)即可。

(3)   对于拼写错误的词语会处理失败。


