Home  >  Article  >  Backend Development  >  Python development MapReduce series WordCount Demo

Python development MapReduce series WordCount Demo

ringa_lee
ringa_leeOriginal
2017-09-17 09:28:381737browse

We know that MapReduce is the core of the elephant hadoop. In Hadoop, the core of data processing is the MapReduce programming model. A Map/Reduce usually splits the input data set into several independent data blocks, which are processed by map tasks (task) in a completely parallel manner. The framework will sort the output of the map first, and then input the results to the reduce task. Typically the input and output of a job are stored in the file system. Therefore, our programming center is mainly the mapper stage and reducer stage.

Let’s develop a MapReduce program from scratch and run it on a hadoop cluster.
mapper code map.py:

 import sys    
    for line in sys.stdin:
        word_list = line.strip().split(' ')    
        for word in word_list:            print '\t'.join([word.strip(), str(1)])


View Code

reducer code reduce.py:

 import sys
    
    cur_word = None
    sum = 0    
    for line in sys.stdin:
        ss = line.strip().split('\t')        
        if len(ss) < 2:            continue
    
        word = ss[0].strip()
        count = ss[1].strip()    
        if cur_word == None:
            cur_word = word    
        if cur_word != word:            print &#39;\t&#39;.join([cur_word, str(sum)])
            cur_word = word
            sum = 0
        
        sum += int(count)    
    print &#39;\t&#39;.join([cur_word, str(sum)])
    sum = 0


View Code

Resource file src.txt (for testing, remember to upload to hdfs when running in the cluster):

hello    
    ni hao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni haoao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao
    Dad would get out his mandolin and play for the family
    Dad loved to play the mandolin for his family he knew we enjoyed singing
    I had to mature into a man and have children of my own before I realized how much he had sacrificed
    I had to,mature into a man and,have children of my own before.I realized how much he had sacrificed

View Code

First debug locally to see if the result is correct. Enter the following command:

cat src.txt | python map.py | sort -k 1 | python reduce.py

The result output in the command line:

a    2
    and    2
    and,have    1
    ao    1
    before    1
    before.I    1
    children    2
    Dad    2
    enjoyed    1
    family    2
    for    2
    get    1
    had    4
    hao    33
    haoao    1
    haoni    3
    have    1
    he    3
    hello    1
    his    2
    how    2
    I    3
    into    2
    knew    1
    loved    1
    man    2
    mandolin    2
    mature    1
    much    2
    my    2
    ni    34
    of    2
    out    1
    own    2
    play    2
    realized    2
    sacrificed    2
    singing    1
    the    2
    to    2
    to,mature    1
    we    1
    would    1

View Code

Found local debugging through debugging, the code is OK. Throw it onto the cluster and run. For convenience, I wrote a special script run.sh to liberate the labor force.

HADOOP_CMD="/home/hadoop/hadoop/bin/hadoop"
    STREAM_JAR_PATH="/home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar"
    
    INPUT_FILE_PATH="/home/input/src.txt"
    OUTPUT_PATH="/home/output"
    
    $HADOOP_CMD fs -rmr  $OUTPUT_PATH 
    
    $HADOOP_CMD jar $STREAM_JAR_PATH \        -input $INPUT_FILE_PATH \        -output $OUTPUT_PATH \        
    -mapper "python map.py" \        -reducer "python reduce.py" \        -file ./map.py \        -file ./reduce.py

Let’s analyze the script below:

 HADOOP_CMD: hadoop的bin的路径
    STREAM_JAR_PATH:streaming jar包的路径
    INPUT_FILE_PATH:hadoop集群上的资源输入路径
    OUTPUT_PATH:hadoop集群上的结果输出路径。(注意:这个目录不应该存在的,因此在脚本加了先删除这个目录。**注意****注意****注意**:若是第一次执行,没有这个目录,会报错的。可以先手动新建一个新的output目录。)
    $HADOOP_CMD fs -rmr  $OUTPUT_PATH
    
    $HADOOP_CMD jar $STREAM_JAR_PATH \        -input $INPUT_FILE_PATH \        -output $OUTPUT_PATH \       
     -mapper "python map.py" \        -reducer "python reduce.py" \       
      -file ./map.py \        -file ./reduce.py                 
      #这里固定格式,指定输入,输出的路径;指定mapper,reducer的文件;
      #并分发mapper,reducer角色的我们用户写的代码文件,因为集群其他的节点还没有mapper、reducer的可执行文件。


Enter the following command to view the records output after the reduce phase:

cat src.txt | python map.py | sort -k 1 | python reduce.py | wc -l
命令行中输出:43

In the browser Enter: master:50030 to view the details of the task.

Kind    % Complete    Num Tasks    Pending    Running    Complete    Killed     Failed/Killed Task Attempts
map       100.00%        2            0        0        2            0            0 / 0
reduce    100.00%        1            0        0        1            0            0 / 0

Saw this in Map-Reduce Framework.

Counter                      Map    Reduce    Total
Reduce output records    0      0          43


Proof that the entire process was successful. The development of the first hadoop program is completed.

The above is the detailed content of Python development MapReduce series WordCount Demo. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn