概述
1.在pycharm中将要查词频的文档放入同一个目录下,hamlet.txt为纯英文文档,threekingdoms.txt为中文文档
2.CalHamletV1.py为纯英文文档词频
CalThreeKingdomsV1 中文初步版词频
CalThreeKingdomsV2 中文升级版词频
3.代码
CalHamletV1.py为纯英文文档词频
#CalHamletV1.py
def getText():
txt = open("hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
SyntaxError: Non-ASCII character 'xe2' in file F:/Python_test/dictionary_test/CalHamletV1.py on line 6, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
原因:有中文需要插入
#coding=utf-8
CalThreeKingdomsV1 中文初步版词频
#coding=utf-8
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count)).encode('utf-8')
出现问题:
File "F:/Python_test/dictionary_test/CalThreeKingdomsV1.py", line 16, in <module>
print ("{0:<10}{1:>5}".format(word, count)).encode('utf-8')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
解决:
点击打开链接
在程序开头位置下加入
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
F:newtestvenvvenvScriptspython.exe F:/Python_test/dictionary_test/CalThreeKingdomsV1.py
Building prefix dict from the default dictionary ...
Loading model from cache c:userszhangappdatalocaltempjieba.cache
Loading model cost 1.482 seconds.
Prefix dict has been built succesfully.
曹操 953
孔明 836
将军 772
却说 656
玄德 585
关公 510
丞相 491
二人 469
不可 440
荆州 425
玄德曰 390
孔明曰 390
不能 384
如此 378
张飞 358
进程已结束,退出代码0
运行成功,但是不能按照人名进行排列,进一步进行修改
CalThreeKingdomsV2 中文升级版词频
#coding=utf-8
#CalThreeKingdomsV2.py
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
问题:
File "F:/Python_test/dictionary_test/CalThreeKingdomsV2.py", line 26, in <module>
del counts[word]
KeyError: 'xe4xbax8cxe4xbaxba'
进程已结束,退出代码1
解决:
编码方式问题
注:
https://blog.csdn.net/weixin_39221360/article/details/79525341
https://blog.csdn.net/qq_34739497/article/details/78001488
最后
以上就是鲤鱼香水为你收集整理的词频统计及出现的相关问题的全部内容,希望文章能够帮你解决词频统计及出现的相关问题所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复