我是靠谱客的博主 鲤鱼香水,最近开发中收集的这篇文章主要介绍词频统计及出现的相关问题,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

1.在pycharm中将要查词频的文档放入同一个目录下,hamlet.txt为纯英文文档,threekingdoms.txt为中文文档

 

2.CalHamletV1.py为纯英文文档词频

   CalThreeKingdomsV1  中文初步版词频

   CalThreeKingdomsV2   中文升级版词频

 

3.代码

CalHamletV1.py为纯英文文档词频

 
                
#CalHamletV1.py
def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
    return txt

hamletTxt = getText()
words  = hamletTxt.split()
counts = {}
for word in words:			
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

SyntaxError: Non-ASCII character 'xe2' in file F:/Python_test/dictionary_test/CalHamletV1.py on line 6, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details   

 

原因:有中文需要插入

#coding=utf-8

CalThreeKingdomsV1  中文初步版词频

#coding=utf-8
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r").read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count)).encode('utf-8')

出现问题:

 File "F:/Python_test/dictionary_test/CalThreeKingdomsV1.py", line 16, in <module>
    print ("{0:<10}{1:>5}".format(word, count)).encode('utf-8')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

解决:

点击打开链接

在程序开头位置下加入

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
F:newtestvenvvenvScriptspython.exe F:/Python_test/dictionary_test/CalThreeKingdomsV1.py
Building prefix dict from the default dictionary ...
Loading model from cache c:userszhangappdatalocaltempjieba.cache
Loading model cost 1.482 seconds.
Prefix dict has been built succesfully.
曹操          953
孔明          836
将军          772
却说          656
玄德          585
关公          510
丞相          491
二人          469
不可          440
荆州          425
玄德曰         390
孔明曰         390
不能          384
如此          378
张飞          358

进程已结束,退出代码0

运行成功,但是不能按照人名进行排列,进一步进行修改

 CalThreeKingdomsV2   中文升级版词频

#coding=utf-8
#CalThreeKingdomsV2.py
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r").read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
         del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

问题:

File "F:/Python_test/dictionary_test/CalThreeKingdomsV2.py", line 26, in <module>
    del counts[word]
KeyError: 'xe4xbax8cxe4xbaxba'

进程已结束,退出代码1

解决:

编码方式问题

 

 

注:

https://blog.csdn.net/weixin_39221360/article/details/79525341

https://blog.csdn.net/qq_34739497/article/details/78001488

 

 

 

 

 

最后

以上就是鲤鱼香水为你收集整理的词频统计及出现的相关问题的全部内容,希望文章能够帮你解决词频统计及出现的相关问题所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(46)

评论列表共有 0 条评论

立即
投稿
返回
顶部