search
HomeBackend DevelopmentPHP TutorialDetailed example of how PHP finds the same records in two large files

(Recommended tutorial: PHP video tutorial)

1. Introduction

Given two a and b There are files with x and y rows of data respectively. (x and y are both greater than 1 billion). The machine memory is limited to 100M. How to find the same records?

2. Idea

  • The difficulty in dealing with this problem is mainly that it is impossible to read this massive data into the memory at one time.
  • If it cannot be read into the memory at one time, can it be considered multiple times? If it is possible, how can we calculate the same value after reading it multiple times?
  • We can use divide and conquer thinking to reduce the big to the small. If the values ​​of the same string are equal after hashing, then we can consider using hash modulo to disperse the records into n files. How to get this n? PHP has 100M memory, and the array can store approximately 1 million data. So considering that records a and b only have 1 billion rows, n must be at least greater than 200.
  • There are 200 files at this time. The same records must be in the same file, and each file can be read into the memory. Then you can find the same records in these 200 files in sequence, and then output them to the same file. The final result is the same records in the two files a and b.
  • It is very simple to find the same record in a small file. Just use each row of records as the key of the hash table and count the number of occurrences of the key >= 2.

3. Practical operation

1 billion files are too big. Practical operation is a waste of time. Just achieve the practical purpose.

The problem size is reduced to: 1M memory limit, a and b each have 100,000 rows of records. The memory limit can be limited by PHP's ini_set('memory_limit', '1M');.

4. Generate test file

Generate random numbers to fill the file:

/**
 * 生成随机数填充文件
 * Author: ClassmateLin
 * Email: classmatelin.site@gmail.com
 * Site: https://www.classmatelin.top
 * @param string $filename 输出文件名
 * @param int $batch 按多少批次生成数据
 * @param int $batchSize 每批数据的大小
 */
function generate(string $filename, int $batch=1000, int $batchSize=10000)
{
    for ($i=0; $i<$batch; $i++) {
        $str = &#39;&#39;;
        for ($j=0; $j<$batchSize; $j++) {
            $str .= rand($batch, $batchSize) . PHP_EOL; // 生成随机数
        }
        file_put_contents($filename, $str, FILE_APPEND);  // 追加模式写入文件
    }
}

generate(&#39;a.txt&#39;, 10);
generate(&#39;b.txt&#39;, 10);

5. Split the file

Change a.txt, b.txt Split into n files by hash modulus.

/**
 * 用hash取模方式将文件分散到n个文件中
 * Author: ClassmateLin
 * Email: classmatelin.site@gmail.com
 * Site: https://www.classmatelin.top
 * @param string $filename 输入文件名
 * @param int $mod 按mod取模
 * @param string $dir 文件输出目录
 */
function spiltFile(string $filename, int $mod=20, string $dir=&#39;files&#39;)
{
    if (!is_dir($dir)){
        mkdir($dir);
    }

    $fp = fopen($filename, &#39;r&#39;);

    while (!feof($fp)){
        $line = fgets($fp);
        $n = crc32(hash(&#39;md5&#39;, $line)) % $mod; // hash取模
        $filepath = $dir . &#39;/&#39; . $n . &#39;.txt&#39;;  // 文件输出路径
        file_put_contents($filepath, $line, FILE_APPEND); // 追加模式写入文件
    }

    fclose($fp);
}

spiltFile(&#39;a.txt&#39;);
spiltFile(&#39;b.txt&#39;);

Execute the splitFile function and get the following picture files 20 files in the directory.

6. Find duplicate records

Now we need to find the same records in 20 files. In fact, we need to find the same records in one file and operate each 20 times.

Find the same record in a file:

/**
 * 查找一个文件中相同的记录输出到指定文件中
 * Author: ClassmateLin
 * Email: classmatelin.site@gmail.com
 * Site: https://www.classmatelin.top
 * @param string $inputFilename 输入文件路径
 * @param string $outputFilename 输出文件路径
 */
function search(string $inputFilename, $outputFilename=&#39;output.txt&#39;)
{
    $table = [];
    $fp = fopen($inputFilename, &#39;r&#39;);

    while (!feof($fp))
    {
        $line = fgets($fp);
        !isset($table[$line]) ? $table[$line] = 1 : $table[$line]++; // 未设置的值设1,否则自增
    }

    fclose($fp);

    foreach ($table as $line => $count)
    {
        if ($count >= 2){ // 出现大于2次的则是相同的记录,输出到指定文件中
            file_put_contents($outputFilename, $line, FILE_APPEND);
        }
    }
}

Find the same record in all files:

/**
 * 从给定目录下文件中分别找出相同记录输出到指定文件中
 * Author: ClassmateLin
 * Email: classmatelin.site@gmail.com
 * Site: https://www.classmatelin.top
 * @param string $dirs 指定目录
 * @param string $outputFilename 输出文件路径
 */
function searchAll($dirs=&#39;files&#39;, $outputFilename=&#39;output.txt&#39;)
{
    $files = scandir($dirs);

    foreach ($files as $file)
    {
        $filepath = $dirs . &#39;/&#39; . $file;
        if (is_file($filepath)){
            search($filepath, $outputFilename);
        }
    }
}

The space problem of large file processing has been solved here, so the time problem How to deal with it? A single machine can handle it by utilizing the multi-core of the CPU. If it is not enough, it can be handled by multiple servers.

7. Complete code

(recommended tutorial: PHP video tutorial)

The above is the detailed content of Detailed example of how PHP finds the same records in two large files. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:脚本之家. If there is any infringement, please contact admin@php.cn delete
PHP and Python: Different Paradigms ExplainedPHP and Python: Different Paradigms ExplainedApr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

PHP and Python: A Deep Dive into Their HistoryPHP and Python: A Deep Dive into Their HistoryApr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

Choosing Between PHP and Python: A GuideChoosing Between PHP and Python: A GuideApr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

PHP and Frameworks: Modernizing the LanguagePHP and Frameworks: Modernizing the LanguageApr 18, 2025 am 12:14 AM

PHP remains important in the modernization process because it supports a large number of websites and applications and adapts to development needs through frameworks. 1.PHP7 improves performance and introduces new features. 2. Modern frameworks such as Laravel, Symfony and CodeIgniter simplify development and improve code quality. 3. Performance optimization and best practices further improve application efficiency.

PHP's Impact: Web Development and BeyondPHP's Impact: Web Development and BeyondApr 18, 2025 am 12:10 AM

PHPhassignificantlyimpactedwebdevelopmentandextendsbeyondit.1)ItpowersmajorplatformslikeWordPressandexcelsindatabaseinteractions.2)PHP'sadaptabilityallowsittoscaleforlargeapplicationsusingframeworkslikeLaravel.3)Beyondweb,PHPisusedincommand-linescrip

How does PHP type hinting work, including scalar types, return types, union types, and nullable types?How does PHP type hinting work, including scalar types, return types, union types, and nullable types?Apr 17, 2025 am 12:25 AM

PHP type prompts to improve code quality and readability. 1) Scalar type tips: Since PHP7.0, basic data types are allowed to be specified in function parameters, such as int, float, etc. 2) Return type prompt: Ensure the consistency of the function return value type. 3) Union type prompt: Since PHP8.0, multiple types are allowed to be specified in function parameters or return values. 4) Nullable type prompt: Allows to include null values ​​and handle functions that may return null values.

How does PHP handle object cloning (clone keyword) and the __clone magic method?How does PHP handle object cloning (clone keyword) and the __clone magic method?Apr 17, 2025 am 12:24 AM

In PHP, use the clone keyword to create a copy of the object and customize the cloning behavior through the \_\_clone magic method. 1. Use the clone keyword to make a shallow copy, cloning the object's properties but not the object's properties. 2. The \_\_clone method can deeply copy nested objects to avoid shallow copying problems. 3. Pay attention to avoid circular references and performance problems in cloning, and optimize cloning operations to improve efficiency.

PHP vs. Python: Use Cases and ApplicationsPHP vs. Python: Use Cases and ApplicationsApr 17, 2025 am 12:23 AM

PHP is suitable for web development and content management systems, and Python is suitable for data science, machine learning and automation scripts. 1.PHP performs well in building fast and scalable websites and applications and is commonly used in CMS such as WordPress. 2. Python has performed outstandingly in the fields of data science and machine learning, with rich libraries such as NumPy and TensorFlow.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Will R.E.P.O. Have Crossplay?
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool