search
HomeBackend DevelopmentPHP TutorialWriting Hadoop MapReduce program using PHP and Shell_PHP Tutorial

Enables any executable program that supports standard IO (stdin, stdout) to become a hadoop mapper or reducer. For example:

Copy the code The code is as follows:

hadoop jar hadoop-streaming.jar -input SOME_INPUT_DIR_OR_FILE -output SOME_OUTPUT_DIR -mapper / bin/cat -reducer /usr/bin/wc

In this example, the cat and wc tools that come with Unix/Linux are used as mapper/reducer. Isn’t it amazing?

If you are used to using some dynamic languages, use dynamic languages ​​to write mapreduce. It is no different from previous programming. Hadoop is just a framework to run it. Let me demonstrate how to use PHP to implement mapreduce of Word Counter.

1. Find the Streaming jar

There is no hadoop-streaming.jar in the Hadoop root directory. Because streaming is a contrib, you have to find it under the contrib. Taking hadoop-0.20.2 as an example, it is here:

Copy code The code is as follows:
$HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar

2. Write Mapper

Create a new wc_mapper.php and write the following code:

Copy code The code is as follows:

#!/usr/bin/php
$in = fopen(“php://stdin”, “r”);
$results = array();
while ( $line = fgets($in, 4096) )
{
$words = preg_split('/W/', $line, 0, PREG_SPLIT_NO_EMPTY);
foreach ($words as $word)
$results[] = $word;
}
fclose ($in);
foreach ($results as $key => $value)
{
print “$valuet1n”;
}

The general meaning of this code is: find the words in each line of input text and output it in the form of "
hello 1
world 1"
.

It’s basically no different from the PHP I wrote before, right? There are two things that may make you feel a little strange:

PHP as an executable program

The "#!/usr/bin/php" in the first line tells Linux to use the program /usr/bin/php as the interpreter for the following code. People who have written Linux shells should be familiar with this writing method. The first line of every shell script is like this: #!/bin/bash, #!/usr/bin/python

With this line, after saving the file, you can directly execute wc_mapper.php as cat and grep commands like this: ./wc_mapper.php

Use stdin to receive input

PHP supports multiple methods of passing in parameters. The most familiar ones should be to get the parameters passed through the Web from the $_GET, $_POST super global variables, and the second is to get the parameters passed from $_SERVER['argv'] Parameters passed in from the command line. Here, the standard input stdin

is used.

The effect of its use is:

Enter ./wc_mapper.php in the linux console

wc_mapper.php runs, and the console enters the state of waiting for user keyboard input

User enters text via keyboard

The user presses Ctrl + D to terminate the input, wc_mapper.php starts executing the real business logic and outputs the execution results

So where is stdout? Print itself is already stdout, which is no different from when we wrote web programs and CLI scripts before.

3. Write Reducer

Create a new wc_reducer.php and write the following code:

Copy the code The code is as follows:

#!/usr /bin/php
$in = fopen(“php://stdin”, “r”);
$results = array();
while ( $line = fgets($in, 4096) )
{
list($key, $value) = preg_split(“/t/”, trim($line), 2);
$results[$key] += $value;
}
fclose($in);
ksort($results);
foreach ($results as $key => $value)
{
print “$keyt$valuen”;
}

The main idea of ​​this code is to count how many times each word appears and output it in the form of "
hello 2
world 1"
.

4. Use Hadoop to run

Upload the sample text to be counted

Copy the code The code is as follows:

hadoop fs - put *.TXT /tmp/input

Execute PHP mapreduce program in Streaming mode

Copy code The code is as follows:
hadoop jar hadoop-0.20.2-streaming.jar -input /tmp/input -output /tmp /output -mapper absolute path to wc_mapper.php -reducer absolute path to wc_reducer.php

Note:

The input and output directories are paths on HDFS

The mapper and reducer are paths on the local machine. Be sure to write absolute paths, do not write relative paths, otherwise Hadoop will report an error saying that the mapreduce program cannot be found.

View results

Copy code The code is as follows:
hadoop fs -cat /tmp/output/part -00000

5. Shell version of Hadoop MapReduce program

Copy code The code is as follows:

#!/bin/bash -

# Load configuration file
source './config.sh'

# Process command line parameters
while getopts "d:" arg
do
case $arg in
d)
date=$OPTARG

?)
                                                                                                                                                                                                                              been have – echo "unkonw argument"

# The default processing date is yesterday
default_date=`date -v-1d +%Y-%m-%d`

# Final processing date. If the date format is incorrect, exit execution

date=${date:-${default_date}}
if ! [[ "$date" =~ [12][0- 9]{3}-(0[1-9]|1[12])-(0[1-9]|[12][0-9]|3[01]) ]]

then

echo "invalid date(yyyy-mm-dd): $date"
exit 1
fi

# Files to be processed
log_files=$(${hadoop_home}bin/hadoop fs -ls ${log_file_dir_in_hdfs} | awk '{print $8}' | grep $date)

# If the number of files to be processed is zero, exit execution

log_files_amount=$(($(echo $log_files | wc -l) + 0))
if [ $log_files_amount -lt 1 ]

then

echo "no log files found"
exit 0
fi

# Input file list
for f in $log_files
do

input_files_list="${input_files_list} $f"

done

function map_reduce () {
if ${hadoop_home}bin/hadoop jar ${streaming_jar_path} -input${input_files_list} -output ${mapreduce_output_dir}${date}/${1}/ -mapper "$ {mapper} ${1}" -reducer "${reducer}" -file "${mapper}"
then

echo "streaming job done!"

else
exit 1
fi
}

# Loop through each bucket
for bucket in ${bucket_list[@]}
do

map_reduce $bucket

done




http://www.bkjia.com/PHPjc/754798.html

www.bkjia.com

http: //www.bkjia.com/PHPjc/754798.htmlTechArticle enables any executable program that supports standard IO (stdin, stdout) to become a hadoop mapper or reducer. For example: Copy the code The code is as follows: hadoop jar hadoop-streaming.jar -input...
Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Explain how load balancing affects session management and how to address it.Explain how load balancing affects session management and how to address it.Apr 29, 2025 am 12:42 AM

Load balancing affects session management, but can be resolved with session replication, session stickiness, and centralized session storage. 1. Session Replication Copy session data between servers. 2. Session stickiness directs user requests to the same server. 3. Centralized session storage uses independent servers such as Redis to store session data to ensure data sharing.

Explain the concept of session locking.Explain the concept of session locking.Apr 29, 2025 am 12:39 AM

Sessionlockingisatechniqueusedtoensureauser'ssessionremainsexclusivetooneuseratatime.Itiscrucialforpreventingdatacorruptionandsecuritybreachesinmulti-userapplications.Sessionlockingisimplementedusingserver-sidelockingmechanisms,suchasReentrantLockinJ

Are there any alternatives to PHP sessions?Are there any alternatives to PHP sessions?Apr 29, 2025 am 12:36 AM

Alternatives to PHP sessions include Cookies, Token-based Authentication, Database-based Sessions, and Redis/Memcached. 1.Cookies manage sessions by storing data on the client, which is simple but low in security. 2.Token-based Authentication uses tokens to verify users, which is highly secure but requires additional logic. 3.Database-basedSessions stores data in the database, which has good scalability but may affect performance. 4. Redis/Memcached uses distributed cache to improve performance and scalability, but requires additional matching

Define the term 'session hijacking' in the context of PHP.Define the term 'session hijacking' in the context of PHP.Apr 29, 2025 am 12:33 AM

Sessionhijacking refers to an attacker impersonating a user by obtaining the user's sessionID. Prevention methods include: 1) encrypting communication using HTTPS; 2) verifying the source of the sessionID; 3) using a secure sessionID generation algorithm; 4) regularly updating the sessionID.

What is the full form of PHP?What is the full form of PHP?Apr 28, 2025 pm 04:58 PM

The article discusses PHP, detailing its full form, main uses in web development, comparison with Python and Java, and its ease of learning for beginners.

How does PHP handle form data?How does PHP handle form data?Apr 28, 2025 pm 04:57 PM

PHP handles form data using $\_POST and $\_GET superglobals, with security ensured through validation, sanitization, and secure database interactions.

What is the difference between PHP and ASP.NET?What is the difference between PHP and ASP.NET?Apr 28, 2025 pm 04:56 PM

The article compares PHP and ASP.NET, focusing on their suitability for large-scale web applications, performance differences, and security features. Both are viable for large projects, but PHP is open-source and platform-independent, while ASP.NET,

Is PHP a case-sensitive language?Is PHP a case-sensitive language?Apr 28, 2025 pm 04:55 PM

PHP's case sensitivity varies: functions are insensitive, while variables and classes are sensitive. Best practices include consistent naming and using case-insensitive functions for comparisons.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function