用python做词频统计

95 阅读 0 评论 63 点赞

我是靠谱客的博主风中秋天，这篇文章主要介绍用python做词频统计，现在分享给大家，希望可以做个参考。

假设有一个本地的txt文件，想对其进行词频统计，可以这样写：

import time
path='C:\Users\zhangxiaomei\Desktop\Walden.txt'
with open(path,'r') as text:
words=text.read().split()
print(words)
for word in words:
time.sleep(3)
print ('{}-{} times'.format(word,words.count(word)))

基本思路是用split函数将文章中的每个单词分开，然后用count函数进行统计。但是这种方法有如下弊端：
1.对标点符号进行了计数
2. python识别大小写，同一个单词因为大小写原因被分别计数了。
可以做如下改进：

import string
import time
path='C:\Users\ZHANGSHUAILING\Desktop\Walden.txt'
with open(path,'r') as text:
words=[raw_word.strip(string.punctuation).lower() for raw_word in text.read().split()]
words_index=set(words)
counts_dict={index:words.count(index) for index in words_index}
for word in sorted(counts_dict,key=lambda x:counts_dict[x],reverse=True):
time.sleep(2)
print ('{}--{} times'.format(word,counts_dict[word]))

优化方案如下：
1.引入string，用strip函数将标点符号（string.punctuation）全部删除掉，然后进行大小写替换再分词。
2.将列表用set函数转换为集合，自动去重，求得索引
3.创建了一个以单词为键，出现次数为值的字典
4.用sorted 函数进行排序，用reverse=True进行逆序，将词频统计从大到小进行排序。