


(Recommended tutorial: PHP video tutorial)
1. Introduction
Given two a and b There are files with x and y rows of data respectively. (x and y are both greater than 1 billion). The machine memory is limited to 100M. How to find the same records?
2. Idea
- The difficulty in dealing with this problem is mainly that it is impossible to read this massive data into the memory at one time.
- If it cannot be read into the memory at one time, can it be considered multiple times? If it is possible, how can we calculate the same value after reading it multiple times?
- We can use divide and conquer thinking to reduce the big to the small. If the values of the same string are equal after hashing, then we can consider using hash modulo to disperse the records into n files. How to get this n? PHP has 100M memory, and the array can store approximately 1 million data. So considering that records a and b only have 1 billion rows, n must be at least greater than 200.
- There are 200 files at this time. The same records must be in the same file, and each file can be read into the memory. Then you can find the same records in these 200 files in sequence, and then output them to the same file. The final result is the same records in the two files a and b.
- It is very simple to find the same record in a small file. Just use each row of records as the key of the hash table and count the number of occurrences of the key >= 2.
3. Practical operation
1 billion files are too big. Practical operation is a waste of time. Just achieve the practical purpose.
The problem size is reduced to: 1M memory limit, a and b each have 100,000 rows of records. The memory limit can be limited by PHP's ini_set('memory_limit', '1M');
.
4. Generate test file
Generate random numbers to fill the file:
/** * 生成随机数填充文件 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $filename 输出文件名 * @param int $batch 按多少批次生成数据 * @param int $batchSize 每批数据的大小 */ function generate(string $filename, int $batch=1000, int $batchSize=10000) { for ($i=0; $i<$batch; $i++) { $str = ''; for ($j=0; $j<$batchSize; $j++) { $str .= rand($batch, $batchSize) . PHP_EOL; // 生成随机数 } file_put_contents($filename, $str, FILE_APPEND); // 追加模式写入文件 } } generate('a.txt', 10); generate('b.txt', 10);
5. Split the file
Change a.txt
, b.txt
Split into n files by hash modulus.
/** * 用hash取模方式将文件分散到n个文件中 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $filename 输入文件名 * @param int $mod 按mod取模 * @param string $dir 文件输出目录 */ function spiltFile(string $filename, int $mod=20, string $dir='files') { if (!is_dir($dir)){ mkdir($dir); } $fp = fopen($filename, 'r'); while (!feof($fp)){ $line = fgets($fp); $n = crc32(hash('md5', $line)) % $mod; // hash取模 $filepath = $dir . '/' . $n . '.txt'; // 文件输出路径 file_put_contents($filepath, $line, FILE_APPEND); // 追加模式写入文件 } fclose($fp); } spiltFile('a.txt'); spiltFile('b.txt');
Execute the splitFile
function and get the following picture files
20 files in the directory.
6. Find duplicate records
Now we need to find the same records in 20 files. In fact, we need to find the same records in one file and operate each 20 times.
Find the same record in a file:
/** * 查找一个文件中相同的记录输出到指定文件中 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $inputFilename 输入文件路径 * @param string $outputFilename 输出文件路径 */ function search(string $inputFilename, $outputFilename='output.txt') { $table = []; $fp = fopen($inputFilename, 'r'); while (!feof($fp)) { $line = fgets($fp); !isset($table[$line]) ? $table[$line] = 1 : $table[$line]++; // 未设置的值设1,否则自增 } fclose($fp); foreach ($table as $line => $count) { if ($count >= 2){ // 出现大于2次的则是相同的记录,输出到指定文件中 file_put_contents($outputFilename, $line, FILE_APPEND); } } }
Find the same record in all files:
/** * 从给定目录下文件中分别找出相同记录输出到指定文件中 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $dirs 指定目录 * @param string $outputFilename 输出文件路径 */ function searchAll($dirs='files', $outputFilename='output.txt') { $files = scandir($dirs); foreach ($files as $file) { $filepath = $dirs . '/' . $file; if (is_file($filepath)){ search($filepath, $outputFilename); } } }
The space problem of large file processing has been solved here, so the time problem How to deal with it? A single machine can handle it by utilizing the multi-core of the CPU. If it is not enough, it can be handled by multiple servers.
7. Complete code
(recommended tutorial: PHP video tutorial)
The above is the detailed content of Detailed example of how PHP finds the same records in two large files. For more information, please follow other related articles on the PHP Chinese website!

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

PHP remains important in the modernization process because it supports a large number of websites and applications and adapts to development needs through frameworks. 1.PHP7 improves performance and introduces new features. 2. Modern frameworks such as Laravel, Symfony and CodeIgniter simplify development and improve code quality. 3. Performance optimization and best practices further improve application efficiency.

PHPhassignificantlyimpactedwebdevelopmentandextendsbeyondit.1)ItpowersmajorplatformslikeWordPressandexcelsindatabaseinteractions.2)PHP'sadaptabilityallowsittoscaleforlargeapplicationsusingframeworkslikeLaravel.3)Beyondweb,PHPisusedincommand-linescrip

PHP type prompts to improve code quality and readability. 1) Scalar type tips: Since PHP7.0, basic data types are allowed to be specified in function parameters, such as int, float, etc. 2) Return type prompt: Ensure the consistency of the function return value type. 3) Union type prompt: Since PHP8.0, multiple types are allowed to be specified in function parameters or return values. 4) Nullable type prompt: Allows to include null values and handle functions that may return null values.

In PHP, use the clone keyword to create a copy of the object and customize the cloning behavior through the \_\_clone magic method. 1. Use the clone keyword to make a shallow copy, cloning the object's properties but not the object's properties. 2. The \_\_clone method can deeply copy nested objects to avoid shallow copying problems. 3. Pay attention to avoid circular references and performance problems in cloning, and optimize cloning operations to improve efficiency.

PHP is suitable for web development and content management systems, and Python is suitable for data science, machine learning and automation scripts. 1.PHP performs well in building fast and scalable websites and applications and is commonly used in CMS such as WordPress. 2. Python has performed outstandingly in the fields of data science and machine learning, with rich libraries such as NumPy and TensorFlow.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SublimeText3 English version
Recommended: Win version, supports code prompts!

SublimeText3 Chinese version
Chinese version, very easy to use

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool