jieba 分词
描述
尾号为1,2,3的同学做,西游记相关的分词,出现次数最高的20个。
尾号为4,5,6的同学做,红楼梦相关的分词,出现次数最高的20个。
尾号为7,8,9,0的同学做,聊斋相关的分词,出现次数最高的20个。
需要把是同一个人不同说法,要合并成一个。比如 孙猴子和孙悟空,要算成一个。
python代码复制
import jieba
excludes = {"什么","一个","我们","那里","如今","你们","说道","知道","起来","姑娘"}
txt = open("D:/python--wj/dm/红楼梦.txt", "r", encoding='ANSI').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word=="贾母" or word =="老太太":
rword ="贾母"
else:
rword = word
counts[word] = counts.get(word,0) + 1
for word in excludes:
del(counts[word])
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(3):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))