Bootstrap

Python实现词频统计的两种方法

词频统计是指在文本中计算每个单词出现的次数,是文本处理中一个最基本的任务。在Python中,可以使用多种方法实现词频统计,包括使用字典、列表、Counter类等数据结构。

一、使用字典

其中,使用字典实现词频统计是最基本的方法之一。具体实现步骤如下:

将文本转换为小写,并分割成单词列表。

text = "This is a sample text with several words. Here are some more words. And here are some more."  
words = text.lower().split()

2.创建一个空字典,用于存储每个单词的出现次数。

word_counts = {}

3.遍历单词列表,如果单词已经在字典中出现过,则将其出现次数加1,否则将其加入字典中并设置其出现次数为1。 

for word in words:  
    if word in word_counts:  
        word_counts[word] += 1  
    else:  
        word_counts[word] = 1

4.打印每个单词的频率。

for word, count in word_counts.items():  
    print(word, count)

输出结果为:

this 1  
is 1  
a 1  
sample 1  
text 1  
with 1  
several 1  
words. 1  
here 2  
are 2  
some 2  
more 2  
and 1

二、使用Counter类

除了使用字典实现词频统计外,Python的collections模块中还提供了Counter类,可以方便地统计可迭代对象中元素的出现次数。使用Counter类实现词频统计的代码如下:

from collections import Counter  
  
text = "This is a sample text with several words. Here are some more words. And here are some more."  
words = text.lower().split()  
word_counts = Counter(words)  
for word, count in word_counts.items():  
    print(word, count)

输出结果与之前使用字典实现词频统计的结果相同。

word_counts.items()的数据类型是dict_items,可以先用dict()转换成字典,再做后续处理。

;