Home > Article > Backend Development > Python development MapReduce series WordCount Demo
We know that MapReduce is the core of the elephant hadoop. In Hadoop, the core of data processing is the MapReduce programming model. A Map/Reduce usually splits the input data set into several independent data blocks, which are processed by map tasks (task) in a completely parallel manner. The framework will sort the output of the map first, and then input the results to the reduce task. Typically the input and output of a job are stored in the file system. Therefore, our programming center is mainly the mapper stage and reducer stage.
Let’s develop a MapReduce program from scratch and run it on a hadoop cluster.
mapper code map.py:
import sys for line in sys.stdin: word_list = line.strip().split(' ') for word in word_list: print '\t'.join([word.strip(), str(1)])
View Code
reducer code reduce.py:
import sys cur_word = None sum = 0 for line in sys.stdin: ss = line.strip().split('\t') if len(ss) < 2: continue word = ss[0].strip() count = ss[1].strip() if cur_word == None: cur_word = word if cur_word != word: print '\t'.join([cur_word, str(sum)]) cur_word = word sum = 0 sum += int(count) print '\t'.join([cur_word, str(sum)]) sum = 0
View Code
Resource file src.txt (for testing, remember to upload to hdfs when running in the cluster):
hello ni hao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni haoao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao Dad would get out his mandolin and play for the family Dad loved to play the mandolin for his family he knew we enjoyed singing I had to mature into a man and have children of my own before I realized how much he had sacrificed I had to,mature into a man and,have children of my own before.I realized how much he had sacrificed
View Code
First debug locally to see if the result is correct. Enter the following command:
cat src.txt | python map.py | sort -k 1 | python reduce.py
The result output in the command line:
a 2 and 2 and,have 1 ao 1 before 1 before.I 1 children 2 Dad 2 enjoyed 1 family 2 for 2 get 1 had 4 hao 33 haoao 1 haoni 3 have 1 he 3 hello 1 his 2 how 2 I 3 into 2 knew 1 loved 1 man 2 mandolin 2 mature 1 much 2 my 2 ni 34 of 2 out 1 own 2 play 2 realized 2 sacrificed 2 singing 1 the 2 to 2 to,mature 1 we 1 would 1
View Code
Found local debugging through debugging, the code is OK. Throw it onto the cluster and run. For convenience, I wrote a special script run.sh to liberate the labor force.
HADOOP_CMD="/home/hadoop/hadoop/bin/hadoop" STREAM_JAR_PATH="/home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar" INPUT_FILE_PATH="/home/input/src.txt" OUTPUT_PATH="/home/output" $HADOOP_CMD fs -rmr $OUTPUT_PATH $HADOOP_CMD jar $STREAM_JAR_PATH \ -input $INPUT_FILE_PATH \ -output $OUTPUT_PATH \ -mapper "python map.py" \ -reducer "python reduce.py" \ -file ./map.py \ -file ./reduce.py
Let’s analyze the script below:
HADOOP_CMD: hadoop的bin的路径 STREAM_JAR_PATH:streaming jar包的路径 INPUT_FILE_PATH:hadoop集群上的资源输入路径 OUTPUT_PATH:hadoop集群上的结果输出路径。(注意:这个目录不应该存在的,因此在脚本加了先删除这个目录。**注意****注意****注意**:若是第一次执行,没有这个目录,会报错的。可以先手动新建一个新的output目录。) $HADOOP_CMD fs -rmr $OUTPUT_PATH $HADOOP_CMD jar $STREAM_JAR_PATH \ -input $INPUT_FILE_PATH \ -output $OUTPUT_PATH \ -mapper "python map.py" \ -reducer "python reduce.py" \ -file ./map.py \ -file ./reduce.py #这里固定格式,指定输入,输出的路径;指定mapper,reducer的文件; #并分发mapper,reducer角色的我们用户写的代码文件,因为集群其他的节点还没有mapper、reducer的可执行文件。
Enter the following command to view the records output after the reduce phase:
cat src.txt | python map.py | sort -k 1 | python reduce.py | wc -l 命令行中输出:43
In the browser Enter: master:50030 to view the details of the task.
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 100.00% 2 0 0 2 0 0 / 0 reduce 100.00% 1 0 0 1 0 0 / 0
Saw this in Map-Reduce Framework.
Counter Map Reduce Total Reduce output records 0 0 43
Proof that the entire process was successful. The development of the first hadoop program is completed.
The above is the detailed content of Python development MapReduce series WordCount Demo. For more information, please follow other related articles on the PHP Chinese website!