词频统计是指在文本中计算每个单词出现的次数,是文本处理中一个最基本的任务。在Python中,可以使用多种方法实现词频统计,包括使用字典、列表、Counter类等数据结构。
一、使用字典
其中,使用字典实现词频统计是最基本的方法之一。具体实现步骤如下:
将文本转换为小写,并分割成单词列表。
text = "This is a sample text with several words. Here are some more words. And here are some more."
words = text.lower().split()
2.创建一个空字典,用于存储每个单词的出现次数。
word_counts = {}
3.遍历单词列表,如果单词已经在字典中出现过,则将其出现次数加1,否则将其加入字典中并设置其出现次数为1。
for word in words:
if word in word_counts:
word_counts[word] += 1
else:
word_counts[word] = 1
4.打印每个单词的频率。
for word, count in word_counts.items():
print(word, count)
输出结果为:
this 1
is 1
a 1
sample 1
text 1
with 1
several 1
words. 1
here 2
are 2
some 2
more 2
and 1
二、使用Counter类
除了使用字典实现词频统计外,Python的collections模块中还提供了Counter类,可以方便地统计可迭代对象中元素的出现次数。使用Counter类实现词频统计的代码如下:
from collections import Counter
text = "This is a sample text with several words. Here are some more words. And here are some more."
words = text.lower().split()
word_counts = Counter(words)
for word, count in word_counts.items():
print(word, count)
输出结果与之前使用字典实现词频统计的结果相同。
word_counts.items()的数据类型是dict_items,可以先用dict()转换成字典,再做后续处理。