概述
title: python实现统计文本当中单词数量 date: 2018-6-30 15:12:43 categories: Python tags: - python
关于用实现统计文本当中单词数量这个功能,代码进行一步一步的升级。
我做个回顾,或许以后还能写出更符合标准的代码。
1 刚看完《python编程:从入门到实践》的时候写的代码
学习python的时候在《python编程:从入门到实践书中第10章中学习了分析文本,当时写出了统计一个单词出现的频率:
# 10-10 常见单词 def row_count(filename): try: with open(filename) as f_obj: content = f_obj.read() except FileNotFoundError: msg = "The file " + filename + " does not exist." print(msg) else: content = content.replace(',', ' ') content = content.replace('.', ' ') content = content.replace('-', ' ') content = content.strip().lower() words = content.split() # 统计row单词出现在文本中的次数 number = words.count('row') print('row : %d' % number) filename = 'Heart of Darkness.txt' row_count(filename)
运行结果为:
row : 9
这个代码只是实现一个单词的出现次数的统计。
并且还有一些问题。比如还有[a、(b这样的标点符号出现在单词中。
2 写完一个单词的统计,又扩展了对所有单词都进行统计并排序
当时写完课后作业时,想到了能不能对所有单词都进行统计并进行排序呢,于是上网查了一些资料,写出了下面的代码:
from operator import itemgetter def words_list(filename): try: with open(filename) as f_obj: content = f_obj.read() except FileNotFoundError: msg = "The file " + filename + " does not exist." print(msg) else: content = content.replace(',', ' ') content = content.replace('.', ' ') content = content.replace('!', ' ') content = content.replace('-', ' ') content = content.replace('_', ' ') content = content.replace('(', ' ') content = content.replace(')', ' ') content = content.strip() words = [word.lower() for word in content.split()] return words def count_results(filename): words_count = {} words = words_list(filename) words_count = words_count.fromkeys(words) for word in words_count.keys(): number = words.count(word) words_count[word] = number words_count = sorted(words_count.items(), key=itemgetter(1), reverse=True) return words_count if __name__ == '__main__': filename = 'Heart of Darkness.txt' words_count = count_results(filename) for word, word_count in words_count[:10]: print('{0:<10} : {1}'.format(word, word_count))
运行结果为:
the : 2440 of : 1492 a : 1205 and : 1045 i : 1039 to : 967 was : 671 in : 668 he : 563 had : 503
还是有问题,还是有标点符号没有彻底清除,类似这种[a]、_a_。
3 学完正则表达式以后使用正则表达式进行分割单词
学会使用正则以后,试着用正则来清除标点符号:
import re from operator import itemgetter def words_list(filename): try: with open(filename) as f_obj: content = f_obj.read() except FileNotFoundError: msg = "The file " + filename + " does not exist." print(msg) else: WORD_RE = re.compile(r'W+') words = WORD_RE.split(content.lower()) return words def count_results(filename): words_count = {} words = words_list(filename) words_count = words_count.fromkeys(words) for word in words_count.keys(): number = words.count(word) words_count[word] = number words_count = sorted(words_count.items(), key=itemgetter(1), reverse=True) return words_count if __name__ == '__main__': filename = 'Heart of Darkness.txt' words_count = count_results(filename) for word, word_count in words_count[:10]: print('{0:<10} : {1}'.format(word, word_count))
统计结果为:
the : 2468 of : 1496 a : 1209 i : 1153 and : 1062 to : 974 in : 673 was : 672 he : 596 it : 515
还是有问题,w+包含[A-Za-z0-9_],所以包含'_'的单词也会被统计出来,比如说‘_that_’也会被统计,并且区别于that。
首先想到就是先把下划线(_)替换为空格即可,不过也可以把_that_和that当做是2种单词。
4 还想到了一个不需要count的解法
今天看另一本书《流畅的python》,在学习字典和集合的时候想到了另一种解法:
def count_results(filename): try: with open(filename) as f_obj: content = f_obj.read() except FileNotFoundError: msg = "The file " + filename + " does not exist." print(msg) else: WORD_RE = re.compile(r'w+') words_count = {} for match in WORD_RE.finditer(content.lower()): word = match.group() # occurrences = words_count.get(word, []) # occurrences.append(word) # words_count[word] = occurrences words_count.setdefault(word, []).append(word) words_count = {word:len(value) for word, value in words_count.items()} words_count = sorted(words_count.items(), key=itemgetter(1), reverse=True) return words_count if __name__ == '__main__': filename = 'Heart of Darkness.txt' words_count = count_results(filename) for word, word_count in words_count[:10]: print('{0:<10} : {1}'.format(word, word_count))
the : 2468 of : 1496 a : 1209 i : 1153 and : 1062 to : 974 in : 673 was : 672 he : 596 it : 515
这个方法比使用count的方法速度要快上不少。
5 count方法和len方法速度比较
import re from operator import itemgetter def words_list(filename): try: with open(filename) as f_obj: content = f_obj.read() except FileNotFoundError: msg = "The file " + filename + " does not exist." print(msg) else: WORD_RE = re.compile(r'W+') words = WORD_RE.split(content.lower()) return words def count_results(filename): words_count = {} words = words_list(filename) words_count = words_count.fromkeys(words) for word in words_count.keys(): number = words.count(word) words_count[word] = number words_count = sorted(words_count.items(), key=itemgetter(1), reverse=True) return words_count def count_words(filename): try: with open(filename) as f_obj: content = f_obj.read() except FileNotFoundError: msg = "The file " + filename + " does not exist." print(msg) else: WORD_RE = re.compile(r'w+') words_count = {} for match in WORD_RE.finditer(content.lower()): word = match.group() # occurrences = words_count.get(word, []) # occurrences.append(word) # words_count[word] = occurrences words_count.setdefault(word, []).append(word) words_count = {word: len(value) for word, value in words_count.items()} words_count = sorted(words_count.items(), key=itemgetter(1), reverse=True) return words_count if __name__ == '__main__': import timeit def test_count_results(): filename = 'Heart of Darkness.txt' words_count = count_results(filename) return words_count def test_count_words(): filename = 'Heart of Darkness.txt' words_count = count_words(filename) return words_count time_1 = timeit.Timer('test_count_results()', setup="from __main__ import test_count_results") time_2 = timeit.Timer('test_count_words()', setup="from __main__ import test_count_words") print(time_1.timeit(number=10)) print(time_2.timeit(number=10))
输出结果为:
54.662403376761674 0.5711055088342789
结果发现差距还是挺大的。
说一个我认为的原因:
count需要遍历列表,遍历列表是很消耗时间的,时间复杂度为O(n);
而使用len()方法时,CPython会直接从内存来读取属性值,当然要快很多。
以下摘自《流畅的python》第一章
如何使用特殊方法
首先明确一点,特殊方法的存在是为了被Python解释器调用的,你自己并不需要调用它们。也就是说没有my_object.__len()__这种写法,而应该使用len(my_object)。在执行len(my_object)的时候,如果my_object是一个自定义类的对象,那么Python会自己去调用其中由你实现的__len__方法。
然而如果是Python内置的类型,比如列表(list)、字符串(str)、字节系列(bytearray)等,那么CPython会抄个近路,__len__实际上会直接返回PyVarObject里的ob_size属性。PyVarObject是表示内存中长度可变的内置对象的C语言结构体。直接读取这个值比调用一个方法要快很多。
目前知识掌握的不太多,还不能很肯定。
说不定以后还有其他的改进,可能未完待续。
6 使用计数器(Counter)方法来实现(目前为止个人觉得最佳解法)
好吧,今天在学习《流畅的python》时,又发现了目前为止我认为最佳解决方式。
collections.Counter
这个映射类型会给键准备一个整体计数器。
每次更新一个键的时候会增加这个计数器。所以这个类型可以用来给可散列表对象计数,或者是当成多重集合来用——多重集合就是集合里的元素可以出现不止一次。
Counter实现了+和-运算符用来合并记录,还有像most_common([n])这类很有用的方法。most_common([n])会按照次序返回映射里最常见的n个键和他们的计数,详情参阅文档(https://docs.python.org/3/library/collections.html#collections.Counter)。
看一下使用Counter类来实现对单词进行统计的代码:
import re import collections def words_list(filename): try: with open(filename) as f_obj: content = f_obj.read() except FileNotFoundError: msg = "The file " + filename + " does not exist." print(msg) else: WORD_RE = re.compile(r'W+') words = WORD_RE.split(content.lower()) return words def words_counter(filename): try: with open(filename) as f_obj: content = f_obj.read() except FileNotFoundError: msg = "The file " + filename + " does not exist." print(msg) else: words = words_list(filename) words_count = collections.Counter(words) return words_count if __name__ != '__main__': filename = 'Heart of Darkness.txt' words_count = words_counter(filename) # 注意这个显示前10需要使用words_count.most_common(10)方法 for word, word_count in words_count.most_common(10): print('{0:<20} : {1}'.format(word, word_count))
结果为:
the : 2468 of : 1496 a : 1209 i : 1153 and : 1062 to : 974 in : 673 was : 672 he : 596 it : 515
连排序(sorted)都省了,理论上应该更快。我们来测试一下速度。
if __name__ == '__main__': import timeit def test_count_results(): filename = 'Heart of Darkness.txt' words_count = count_results(filename) return words_count def test_count_words(): filename = 'Heart of Darkness.txt' words_count = count_words(filename) return words_count def test_words_counter(): filename = 'Heart of Darkness.txt' words_count = words_counter(filename) return words_count time_1 = timeit.Timer('test_count_results()', setup="from __main__ import test_count_results") time_2 = timeit.Timer('test_count_words()', setup="from __main__ import test_count_words") time_3 = timeit.Timer('test_words_counter()', setup="from __main__ import test_words_counter") print(time_1.timeit(number=10)) print(time_2.timeit(number=10)) print(time_3.timeit(number=10))
结果为:
36.760553128615015 0.37998618675253937 0.2645591842058579
果然速度更快了。
还能未完待续吗?
最后
以上就是动听百褶裙为你收集整理的python实现统计文本当中单词数量的全部内容,希望文章能够帮你解决python实现统计文本当中单词数量所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复