


Simple implementation of sorting and deduplication of large files
有一道校招生的面试题,是要给一个很大的文件(不能全部放内存,比如1T)按行来排序和去重。
一种简单解决方案就是分而治之,先打大文件分词大小均匀的若干个小文件,然后对小文件排好序,最后再Merge所有的小文件,在Merge的过程中去掉重复的内容。
在Linux下实现这个逻辑甚至不用自己写代码,只要用shell内置的一些命令: split, sort就足够了。我们把这个流程用脚本串起来,写到shell脚本文件里。文件名叫sort_uniq.sh.
#!/bin/bash lines=$(wc -l $1 | sed 's/ .*//g') lines_per_file=`expr $lines / 20` split -d -l $lines_per_file $1 __part_$1 for file in __part_* do { sort $file > sort_$file } & done wait sort -smu sort_* > $2 rm -f __part_* rm -f sort_*
使用方法:./sort_uniq.sh file_to_be_sort file_sorted
这段代码把大文件分词20或21个小文件,后台并行排序各个小文件,最后合并结果并去重。
如果只要去重,不需要排序,还有另外一种思路:对文件的每一行计算hash值,按照hash值把该行内容放到某个小文件中,假设需要分词100个小文件,则可以按照(hash % 100)来分发文件内容,然后在小文件中实现去重就可以了。
The above is the detailed content of Simple implementation of sorting and deduplication of large files. For more information, please follow other related articles on the PHP Chinese website!

The five core components of the Linux operating system are: 1. Kernel, 2. System libraries, 3. System tools, 4. System services, 5. File system. These components work together to ensure the stable and efficient operation of the system, and together form a powerful and flexible operating system.

The five core elements of Linux are: 1. Kernel, 2. Command line interface, 3. File system, 4. Package management, 5. Community and open source. Together, these elements define the nature and functionality of Linux.

Linux user management and security can be achieved through the following steps: 1. Create users and groups, using commands such as sudouseradd-m-gdevelopers-s/bin/bashjohn. 2. Bulkly create users and set password policies, using the for loop and chpasswd commands. 3. Check and fix common errors, home directory and shell settings. 4. Implement best practices such as strong cryptographic policies, regular audits and the principle of minimum authority. 5. Optimize performance, use sudo and adjust PAM module configuration. Through these methods, users can be effectively managed and system security can be improved.

The core operations of Linux file system and process management include file system management and process control. 1) File system operations include creating, deleting, copying and moving files or directories, using commands such as mkdir, rmdir, cp and mv. 2) Process management involves starting, monitoring and killing processes, using commands such as ./my_script.sh&, top and kill.

Shell scripts are powerful tools for automated execution of commands in Linux systems. 1) The shell script executes commands line by line through the interpreter to process variable substitution and conditional judgment. 2) The basic usage includes backup operations, such as using the tar command to back up the directory. 3) Advanced usage involves the use of functions and case statements to manage services. 4) Debugging skills include using set-x to enable debugging mode and set-e to exit when the command fails. 5) Performance optimization is recommended to avoid subshells, use arrays and optimization loops.

Linux is a Unix-based multi-user, multi-tasking operating system that emphasizes simplicity, modularity and openness. Its core functions include: file system: organized in a tree structure, supports multiple file systems such as ext4, XFS, Btrfs, and use df-T to view file system types. Process management: View the process through the ps command, manage the process using PID, involving priority settings and signal processing. Network configuration: Flexible setting of IP addresses and managing network services, and use sudoipaddradd to configure IP. These features are applied in real-life operations through basic commands and advanced script automation, improving efficiency and reducing errors.

The methods to enter Linux maintenance mode include: 1. Edit the GRUB configuration file, add "single" or "1" parameters and update the GRUB configuration; 2. Edit the startup parameters in the GRUB menu, add "single" or "1". Exit maintenance mode only requires restarting the system. With these steps, you can quickly enter maintenance mode when needed and exit safely, ensuring system stability and security.

The core components of Linux include kernel, shell, file system, process management and memory management. 1) Kernel management system resources, 2) shell provides user interaction interface, 3) file system supports multiple formats, 4) Process management is implemented through system calls such as fork, and 5) memory management uses virtual memory technology.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Dreamweaver CS6
Visual web development tools

WebStorm Mac version
Useful JavaScript development tools

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),
