Home >Backend Development >PHP Tutorial >Implementation of batch processing in PHP_PHP tutorial
What should you do if a feature in your web application takes more than 1 or 2 seconds to complete? Some kind of offline processing solution is needed. Learn several ways to serve long-running jobs offline in PHP applications.
Large chain stores have a big problem. Every day, thousands of transactions occur in every store. Company executives want to mine this data. Which products sell well? What's bad? Where do organic products sell well? How are ice cream sales going?
In order to capture this data, organizations must load all transactional data into a data model that is more suitable for generating the types of reports the company requires. However, this takes time, and as the chain grows, it can take more than a day to process a day's worth of data. So, this is a big problem.
Now, your web application may not need to process this much data, but any site will likely take longer to process than your customers are willing to wait. Generally speaking, the time that customers are willing to wait is 200 milliseconds. If it exceeds this time, customers will feel that the process is "slow".This number is based on desktop applications, while the Web makes us more patient. But no matter what, you shouldn't make your customers wait longer than a few seconds. Therefore, some strategies should be adopted to handle batch jobs in PHP.
Decentralized approach with cron
On UNIX® machines, the core program for executing batch processing is the cron daemon. The daemon reads a configuration file that tells it which command lines to run and how often. The daemon then executes them as configured. When an error is encountered, it can even send error output to a specified email address to help debug the problem.
I know some engineers who strongly advocate the use of threading technology. "Threads! Threads are the real way to do background processing. The cron daemon is outdated."
I don't think so.
I have used both methods. I think cron has the advantage of the "Keep It Simple, Stupid (KISS, simple is beautiful)" principle. It keeps background processing simple. Instead of writing a multi-threaded job processing application that runs all the time (so there are no memory leaks), cron starts a simple batch script. This script determines whether there is a job to process, executes the job, and then exits. No need to worry about memory leaks. There's also no need to worry about threads stalling or getting stuck in infinite loops.
So, how does cron work? This depends on your system environment. I'll only discuss the old simple UNIX command line version of cron, you can ask your system administrator how to implement it in your own web application.
Here is a simple cron configuration that runs a PHP script at 11pm every night:
0 23 * * * jack /usr/bin/php /users/home/jack/ myscript.php
The first 5 fields define when the script should be started. Then the username that should be used to run this script. The remaining commands are the command lines to be executed. The time fields are minutes, hours, day of month, month, and day of week. Here are a few examples.
Command:
15 * * * * jack /usr/bin/php /users/home/jack/myscript.php
at each hour Run the script at minute 15.
Command:
15,45 * * * * jack /usr/bin/php /users/home/jack/myscript.php
in each Run the script at the 15th and 45th minute of the hour.
Command:
*/1 3-23 * * * jack /usr/bin/php /users/home/jack/myscript.php
in Run the script every minute between 3am and 11pm.
Command
30 23 * * 6 jack /usr/bin/php /users/home/jack/myscript.php
Every Saturday night at 11 :30 runs the script (Saturday is specified by 6 ).
As you can see, the number of combinations is unlimited. You can control when the script is run as needed. You can also specify multiple scripts to run, so that some scripts can be run every minute, while other scripts (such as backup scripts) can be run only once a day.
In order to specify which email address to send reported errors to, you can use the MAILTO directive, as follows:
MAILTO=jherr@pobox.com
Note: For Microsoft® Windows® users, there is an equivalent Scheduled Tasks system that can be used to launch command line processes (such as PHP scripts) at regular intervals.
Back to Top
Basics of Batch Architecture
Batch processing is fairly simple. In most cases, one of two workflows is used. The first workflow is for reporting; the script runs once a day, it generates the report and sends it to a group of users. The second workflow is a batch job created in response to some kind of request. For example, I logged into the web application and asked it to send a message to all users registered in the system telling them about a new feature. This operation must be batched because there are 10,000 users in the system. PHP takes a while to complete such a task, so it must be performed by a job outside the browser.
In the second workflow, the web application simply puts the information somewhere and lets the batch application share it. These messages specify the nature of the operation (for example, "Send this e-mail to all the people on the system".) batch program runs the job and then deletes the job. Alternatively, the handler marks the job as completed. Regardless of the method used, the job should be recognized as completed so that it is not run again.
The remainder of this article demonstrates various methods of sharing data between a web application front end and a batch backend.
Back to top
Mail Queue
The first method is to use a dedicated mail queue system. In this model, a table in the database contains email messages that should be sent to various users. The web interface uses the mailouts class to add emails to the queue. Email handlers use the mailouts class to retrieve unprocessed emails and then use it again to remove unprocessed emails from the queue.
This model first requires MySQL schema.
List 1. mailout.sql
DROP TABLE IF EXISTS mailouts;CREATE TABLE mailouts ( id MEDIUMINT NOT NULL AUTO_INCREMENT, from_address TEXT NOT NULL, to_address TEXT NOT NULL, subject TEXT N OT NULL, content TEXT NOT NULL , PRIMARY KEY (id));
This mode is very simple. Each line has a from and a to address, as well as the subject and content of the email.
It is the PHP mailouts class that processes the mailouts table in the database.
List 2. mailouts.php
getMessage()); } return $db; } public static function delete( $id ) { $db = Mailouts::get_db(); $sth = $db->prepare( 'DELETE FROM mailouts WHERE id=?' ); $db->execute( $sth, $id ); return true; } public static function add( $from, $to, $subject, $content) { $db = Mailouts::get_db(); $sth = $db->prepare( 'INSERT INTO mailouts VALUES (null,?,?,?,?)' ); $db->execute( $sth, array( $from, $to, $subject, $content ) ); return true; } public static function get_all( ) { $db = Mailouts::get_db(); $res = $db->query( "SELECT * FROM mailouts" ); $rows = array(); while( $res->fetchInto( $row ) ) { $rows []= $row; } return $rows; }}?>
This script contains the Pear::DB database access class. Then define the mailouts class, which contains three main static functions: add, delete and get_all. The add() method adds an email to the queue. This method is used by the frontend. The get_all() method returns all data from the table. The delete() method deletes an email.
You may ask why I don’t just call the delete_all() method at the end of the script. There are two reasons for not doing this: if you delete each message after it is sent, it is unlikely that the message will be sent twice even if the script is re-run after the problem occurs; new ones may be added between the start and completion of the batch job information.
The next step is to write a simple test script that adds an entry to the queue.
List 3. mailout_test_add.php
In this example, I add a mailout, and this message is to be sent to Molly of a company, including Subject "Test Subject" and email body. You can run this script on the command line: php mailout_test_add.php.
In order to send the email, another script is required, this script acts as a job handler.
List 4. mailout_send.php
This script uses the get_all() method to retrieve all email messages and then uses PHP's mail () method sends messages one by one. After each successful email is sent, the delete() method is called to delete the corresponding record from the queue.
Use the cron daemon to run this script periodically. How often you run this script depends on the needs of your application.
Note: The PHP Extension and Application Repository (PEAR) repository contains an excellent Mail Queuing System implementation that is free to download.
Back to top
A more general approach
Specialized solutions for sending emails are great, but is there a more general approach? We need to be able to send emails, generate reports, or perform other time-consuming processing without having to wait in the browser for the processing to complete.
For this, you can take advantage of the fact that PHP is an interpreted language. PHP code can be stored in a queue in the database and executed later. This requires two tables, see Listing 5.
List 5. generic.sql
DROP TABLE IF EXISTS processing_items;CREATE TABLE processing_items (id MEDIUMINT NOT NULL AUTO_INCREMENT, function TEXT NOT NULL, PRIMARY KEY (id));DROP TABLE IF EXISTS processing_args;CREATE TABLE processing_args (id MEDIUMINT NOT NULL AUTO_INCREMENT, item_id MEDIUMINT NOT NULL, key_name TEXT NOT NULL, value TEXT NOT NULL, PRIMARY KEY (id));
First table processing_items Contains job handler calls function. The second table, processing_args , contains the arguments to be sent to the function, in the form of a hash table of key/value pairs.
Like the mailouts table, these two tables are also wrapped by a PHP class called ProcessingItems.
清单 6. generic.php
prepare( 'DELETE FROM processing_args WHERE item_id=?' ); $db->execute( $sth, $id ); $sth = $db->prepare( 'DELETE FROM processing_items WHERE id=?' ); $db->execute( $sth, $id ); return true; } public static function add( $function, $args ) { $db = ProcessingItems::get_db(); $sth = $db->prepare( 'INSERT INTO processing_items VALUES (null,?)' ); $db->execute( $sth, array( $function ) ); $res = $db->query( "SELECT last_insert_id()" ); $id = null; while( $res->fetchInto( $row ) ) { $id = $row[0]; } foreach( $args as $key => $value ) { $sth = $db->prepare( 'INSERT INTO processing_args VALUES (null,?,?,?)' ); $db->execute( $sth, array( $id, $key, $value ) ); } return true; } public static function get_all() { $db = ProcessingItems::get_db(); $res = $db->query( "SELECT * FROM processing_items" ); $rows = array(); while( $res->fetchInto( $row ) ) { $item = array(); $item['id'] = $row[0]; $item['function'] = $row[1]; $item['args'] = array(); $ares = $db->query( "SELECT key_name, value FROM processing_args WHERE item_id=?", $item['id'] ); while( $ares->fetchInto( $arow ) ) $item['args'][ $arow[0] ] = $arow[1]; $rows []= $item; } return $rows; }}?>
这个类包含三个重要的方法:add()、get_all() 和 delete()。与 mailouts 系统一样,前端使用 add(),处理引擎使用 get_all() 和 delete()。
清单 7 所示的测试脚本将一个条目添加到处理队列中。
清单 7. generic_test_add.php
'foo' ) );?>
在这个示例中,添加了一个对 printvalue 函数的调用,并将 value 参数设置为 foo。我使用 PHP 命令行解释器运行这个脚本,并将这个方法调用放进队列中。然后使用以下处理脚本运行这个方法。
清单 8. generic_process.php
这个脚本非常简单。它获得 get_all() 返回的处理条目,然后使用 call_user_func_array(一个 PHP 内部函数)用给定的参数动态地调用这个方法。In this example, the local printvalue function is called.
To demonstrate this functionality, let’s look at what happens on the command line:
% php generic_test_add.php % php generic_process.php Printing: foo%
The output isn't much, but you can see the gist. Through this mechanism, the processing of any PHP function can be deferred.
Now, if you don't like putting PHP function names and parameters into the database, then another approach is to create a link in the PHP code between the "Processing Job Type" name in the database and the actual PHP processing function mapping. This way, if you later decide to modify the PHP backend, the system will still work as long as the "processing job type" string matches.
Back to top
Ditch the database
Finally, I demonstrate a slightly different solution that uses a file in a directory to store the batch job instead of using a database. The idea provided here is not to suggest that you "adopt this method instead of using a database", it is just an alternative method, and it is up to you to decide whether to adopt it.
Obviously, there is no schema in this solution since we are not using a database. So first write a class that contains add(), get_all(), and delete() methods similar to the previous example.
清单 9. batch_by_file.php
$v ) { fprintf( $fh, $k.":".$v."n" ); } fclose( $fh ); return true; } public static function get_all() { $rows = array(); if (is_dir(BATCH_DIRECTORY)) { if ($dh = opendir(BATCH_DIRECTORY)) { while (($file = readdir($dh)) !== false) { $path = BATCH_DIRECTORY.$file; if ( is_dir( $path ) == false ) { $item = array(); $item['id'] = $path; $fh = fopen( $path, 'r' ); if ( $fh ) { $item['function'] = trim(fgets( $fh )); $item['args'] = array(); while( ( $line = fgets( $fh ) ) != null ) { $args = split( ':', trim($line) ); $item['args'][$args[0]] = $args[1]; } $rows []= $item; fclose( $fh ); } } }Closedir ($ dh);}} Return $ ROWS;}? & Gt;
BatchFiles classes have three main methods: ADD (), get_all (), and delete (). This class does not access the database, but reads and writes files in the batch_items directory.
Use the following test code to add a new batch entry.
List 10. batch_by_file_test_add.php
'foo' ) );?> ;
One thing to note: apart from the class name (BatchFiles), there is actually no indication of how the jobs are stored. Therefore, it is easy to change it to database-style storage in the future without modifying the interface.
Finally is the handler code.
List 11. batch_by_file_processor.php
This code is almost identical to the database version, except that the file name and class name have been modified.
Back to top
Conclusion
As mentioned earlier, the server provides a lot of support for threads and can perform background batch processing. In some cases, it's definitely easier to use a worker thread to handle small jobs. However, batch jobs can also be created in PHP applications using traditional tools (cron, MySQL, standard object-oriented PHP and Pear::DB), which are easy to implement, deploy and maintain.
References
Learning
You can refer to the original English text of this article on the developerWorks global site.
Learn more about PHP by reading IBM developerWorks' PHP Project Resource Center.
PHP.net is an excellent resource for PHP developers.
PEAR Mail_Queue package is a robust mail queue implementation that includes a database backend.
The crontab manual provides details of cron configuration, but it is not easy to understand.
The section on Using PHP from the command line in the PHP manual can help you understand how to run scripts from cron.
Stay tuned to developerWorks technical events and webcasts.
Learn about upcoming conferences, exhibitions, webcasts, and other events around the world where IBM open source developers can learn about the latest technology developments.
Visit the developerWorks Open Source Technology Zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM products.
developerWorks podcasts include many interesting interviews and discussions suitable for software developers.
Get products and technology
Check out PEAR -- PHP Extension and Application Repository, which includes Pear::DB.
Improve your next open source development project with IBM trial software, available as a download or on DVD.
Discussion
developerWorks PHP Developer Forum provides a place for all PHP developers to discuss technical issues. If you have questions about PHP scripts, functions, syntax, variables, debugging and other topics, you can ask them here.
Join the developerWorks community by participating in the developerWorks blog.
About the author
Jack D. Herrington is a senior software engineer with more than 20 years of work experience. He is the author of three books: Code Generation in Action, Podcasting Hacks, and PHP Hacks, and more than 30 articles.