你有一个目录，放了你一个月的日记，都是 txt，为了避免分词的问题，假设内容都是英文，请统计出你认为每篇日记最重要的词。

263 阅读 0 评论 174 点赞

我是靠谱客的博主娇气洋葱，这篇文章主要介绍你有一个目录，放了你一个月的日记，都是 txt，为了避免分词的问题，假设内容都是英文，请统计出你认为每篇日记最重要的词。，现在分享给大家，希望可以做个参考。

分析：

目录层级为1并且都是txt文件，条件很纯，首先要做的就是遍历目录得到所有的txt文件的名称，这里我们使用glob模块，当然也可以使用os模块的listdir()方法（仅限单纯条件，否则要做路径过滤）
拿到每个文件名之后就是读内容然后计数取最多个数的单词了，这里我们使用collections模块的Counter()方法
因为文本中单词是区分大小写的，所以统一转换为小写
对于量词、冠词等做去除工作以免影响工作

代码实现

import glob
from collections import Counter
import re

#获取目录中所有txt结尾的文件名列表，因为本题条件单纯，所以也可以使用os.listdir()方法
def list_txt():
    return glob.glob("*.txt")


def wc(filename):
    exclude_words = ['the', 'in', 'of', 'and', 'to', 'has', 'that', 'this','s', 'is', 'are', 'a', 'with', 'as', 'an']  
    datalist = []
    with open(filename, 'r') as f:
        for line in f:
            content = re.sub(""|,>|.", "", line.lower())
            datalist.extend(content.strip().split(' '))

    wordlst = Counter(datalist)
    for word in exclude_words:
        wordlst[word]=0
    return wordlst.most_common(1)


def most_comm():
    all_txt = list_txt()#os.listdir("./")
    for txt in all_txt:
        print wc(txt)

if __name__ == "__main__":
    # most_comm()
    #一个map()函数省了很多事
    print map(wc, list_txt())

拓展知识

1. glob

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.

glob模块使用模式匹配来获取指定路径下的文件名数组（返回结果无序）

For example, consider a directory containing only the following files: 1.gif, 2.txt, and card.gif. glob() will produce the following results.

>>> import glob
>>> glob.glob('./[0-9].*')
['./1.gif', './2.txt']
>>> glob.glob('*.gif')
['1.gif', 'card.gif']
>>> glob.glob('?.gif')
['1.gif']

If the directory contains files starting with . they won’t be matched by default. For example, consider a directory containing card.gif and .card.gif:
对于文件名称以“.”开始的文件不会被匹配

>>> import glob
>>> glob.glob('*.gif')
['card.gif']
>>> glob.glob('.c*')
['.card.gif']

2. collections

High-performance container datatypes

Counter objects

counter使用来进行方便快捷的计数的，如：

>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
...     cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
 >>> c = Counter(a=4, b=2, c=0, d=-2)
>>> list(c.elements())
['a', 'a', 'a', 'a', 'b', 'b']

most_common([n])

Return a list of the n most common elements and their counts from the most common to the least. If n is omitted or None, most_common() returns all elements in the counter. Elements with equal counts are ordered arbitrarily:

>>> Counter('abracadabra').most_common(3)
[('a', 5), ('r', 2), ('b', 2)]

更多用法见官方https://docs.python.org/2/library/collections.html

Python 的 re 模块提供了re.sub用于替换字符串中的匹配项。

语法：

re.sub(pattern, repl, string, count=0, flags=0)

参数：

pattern : 正则中的模式字符串。
repl : 替换的字符串，也可为一个函数。
string : 要被查找替换的原始字符串。
count : 模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。

以下实例中将字符串中的匹配的数字乘于 2：

    #!/usr/bin/python
    # -*- coding: UTF-8 -*-

    import re

    # 将匹配的数字乘于 2
    def double(matched):
        value = int(matched.group('value'))
        return str(value * 2)

    s = 'A23G4HFD567'
    print(re.sub('(?P<value>d+)', double, s))