词频统计及出现的相关问题

306 阅读 0 评论 202 点赞

我是靠谱客的博主鲤鱼香水，这篇文章主要介绍词频统计及出现的相关问题，现在分享给大家，希望可以做个参考。

1.在pycharm中将要查词频的文档放入同一个目录下，hamlet.txt为纯英文文档，threekingdoms.txt为中文文档

2.CalHamletV1.py为纯英文文档词频

CalThreeKingdomsV1 中文初步版词频

CalThreeKingdomsV2 中文升级版词频

3.代码

CalHamletV1.py为纯英文文档词频

                
#CalHamletV1.py
def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
    return txt

hamletTxt = getText()
words  = hamletTxt.split()
counts = {}
for word in words:			
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

SyntaxError: Non-ASCII character 'xe2' in file F:/Python_test/dictionary_test/CalHamletV1.py on line 6, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

原因：有中文需要插入

#coding=utf-8

CalThreeKingdomsV1 中文初步版词频

#coding=utf-8
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r").read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count)).encode('utf-8')

出现问题：

 File "F:/Python_test/dictionary_test/CalThreeKingdomsV1.py", line 16, in <module>
    print ("{0:<10}{1:>5}".format(word, count)).encode('utf-8')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

解决：

点击打开链接

在程序开头位置下加入

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

F:newtestvenvvenvScriptspython.exe F:/Python_test/dictionary_test/CalThreeKingdomsV1.py
Building prefix dict from the default dictionary ...
Loading model from cache c:userszhangappdatalocaltempjieba.cache
Loading model cost 1.482 seconds.
Prefix dict has been built succesfully.
曹操          953
孔明          836
将军          772
却说          656
玄德          585
关公          510
丞相          491
二人          469
不可          440
荆州          425
玄德曰         390
孔明曰         390
不能          384
如此          378
张飞          358

进程已结束,退出代码0

运行成功，但是不能按照人名进行排列，进一步进行修改

CalThreeKingdomsV2 中文升级版词频

#coding=utf-8
#CalThreeKingdomsV2.py
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r").read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
         del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

问题：

File "F:/Python_test/dictionary_test/CalThreeKingdomsV2.py", line 26, in <module>
    del counts[word]
KeyError: 'xe4xbax8cxe4xbaxba'

进程已结束,退出代码1

解决：

编码方式问题

注：

https://blog.csdn.net/weixin_39221360/article/details/79525341

https://blog.csdn.net/qq_34739497/article/details/78001488

最后

以上就是鲤鱼香水最近收集整理的关于词频统计及出现的相关问题的全部内容，更多相关词频统计及出现内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：Other
浏览次数：306 次浏览
发布日期：2024-08-08 04:15:01

词频统计及出现的相关问题

最后

评论列表共有 0 条评论

发表评论取消回复

词频统计及出现的相关问题

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复