词频统计任务编程实践-526互联

编写一个MapReduce词频统计程序，你需要使用Hadoop或其他MapReduce框架。以下是一个简单的Python示例，使用Hadoop Streaming来执行词频统计任务。请确保你已经安装了Hadoop和配置了Hadoop Streaming。

假设你已经创建了两个文本文件wordfile1.txt和wordfile2.txt，然后你可以编写以下Map和Reduce脚本：

mapper.py：

python

#!/usr/bin/env python

import sys
import re

# 使用正则表达式来拆分单词
word_pattern = re.compile(r'\w+')

for line in sys.stdin:
    # 移除开头和结尾的空格，并转换为小写
    line = line.strip().lower()
    words = word_pattern.findall(line)
    for word in words:
        # 输出键值对（单词, 1）
        print(f"{word}\t1")

reducer.py：

python

#!/usr/bin/env python

import sys

current_word = None
current_count = 0

for line in sys.stdin:
    word, count = line.strip().split('\t', 1)
    count = int(count)

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print(f"{current_word}\t{current_count}")
        current_word = word
        current_count = count

if current_word:
    print(f"{current_word}\t{current_count}")

接下来，你可以使用Hadoop Streaming来运行MapReduce任务：

bash

hadoop jar /path/to/hadoop-streaming.jar \
-files mapper.py,reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input wordfile1.txt,wordfile2.txt \
-output wordcount_output

这将在Hadoop集群上运行你的MapReduce任务，统计单词的词频，最后的结果将存储在wordcount_output文件夹中。