Home > Article > Backend Development > PHP and shell large file data statistics and sorting methods
Contents of this section: How to sort big data using shell and php Big data problem, for example, if there is a 4G file, how to use a machine with only 1G memory to calculate the most frequent numbers in the file (assuming that 1 line is an array, such as QQ number). If the file is only 4B or dozens of megabytes, then the easiest way is to directly read the file and perform analysis and statistics. But this is a 4G file. Of course, it may be tens of G or even hundreds of G. This cannot be solved by direct reading. Similarly, for such a large file, it is definitely not feasible to use PHP alone. My idea is that no matter how big the file is, it must first be cut into small files that can be tolerated by multiple applications, and then the small files can be analyzed and counted in batches or in sequence. The total results are summarized and the final result that meets the requirements is calculated. Similar to the popular MapReduce model, its core ideas are "Map (mapping)" and "Reduce (simplification)", plus distributed file processing. Of course, the only thing I can understand and use is Reduce for processing. Suppose there is a file with 1 billion lines, each line has a QQ number ranging from 6 to 10 digits, then what I need to solve is to calculate the top 10 most repeated numbers among these 1 billion QQ numbers, using The following PHP script generates this file. It is likely that there will be no duplicates in this random number, but it is assumed that there will be duplicate numbers in it. For example,
The world of generating files is relatively long. Use php-client directly under Linux Running PHP files will save time. Of course, you can also use other methods to generate files. The generated file is about 11G. Then use Linux Split to cut the file. The cutting standard is 1 file for every 1 million rows of data. split -l 1000000 -a 3 qq.txt qqfile qq.txt is divided into 1000 files named qqfileaaa to qqfilebml, each file is 11mb in size. It will be relatively simple to use any processing method at this time. Use PHP for analysis and statistics:
so that the top 10 of each sample are taken, and finally put together for analysis and statistics, it is not ruled out that there is a number that ranks in each sample 11th place but the total number is definitely in the top 10, so the subsequent statistical calculation algorithm needs to be improved. Some people may say that sorting can be done using the awk and sort commands in Linux, but I tried it and it can be done if it is a small file, but for an 11G file, neither memory nor time can bear it. 1 awk+sort script: awk -F '\@' '{name[$1]++ } END {for (count in name) print name[count],count}' qq.txt |sort -n > 123.txt Whether it is large file processing or possible big data, there is a huge demand. |