Find duplicate files using Linux-Linux Operation and Maintenance-php.cn

Home

Operation and Maintenance

Linux Operation and Maintenance

Find duplicate files using Linux

Linux中文社区

Aug 03, 2023 pm 03:51 PM

linux

#Method 1: Use the Find command

This section is an extended usage description of the powerful function of find. On the basis of find, we can combine it with other basic Linux commands (such as the xargs command) to create unlimited command line functions. For example, we can quickly find files in a Linux folder and its subfolders. List of duplicate files. The process to implement this function is relatively simple. Just search and traverse all the files, and then use the command to compare the MD5 of each file.

It sounds abstract, but in fact there is only one command:

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

find -not -empty -type f -printf "%sn" means use find The command searches out all non-empty files and prints out their sizes.
sort -rn Needless to say, this command is based on file size. Reverse sort
##uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 means only duplicate lines are printed , used here means to print out files with the same file name
Method 2: Use dupeGuru tool

DupeGuru is a cross-platform application, available in Linux, Windows and Mac OS X versions. It can pass file size, Various standards such as MD5 and file names are used to help users find duplicate files in Linux. Ubuntu users can install it directly by adding the following PPA source:

sudo add-apt-repository ppa:hsoft/ppasudo apt-get updatesudo apt-get install dupeguru*

方法三：使用Find命令解析

在工作生活当中，我们很可能会遇到查找重复文件的问题。比如从某游戏提取的游戏文本有重复的，我们希望找出所有重复的文本，让翻译只翻译其中一份，而其他的直接替换。那么这个问题该怎么做呢？当然方法多种多样，而且无论那种方法应该都不会太难，但笔者第一次遇到这个问题的时候第一反应是是用Linux的Shell脚本，所以文本介绍这种方式。

先上代码：

find -not -empty -type f -printf "%sn" | sort -rn |uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -b 36-

大家先cd到自己想要查找重复文件的文件夹，然后copy上面代码就可以了，系统会对当前文件夹及子文件夹内的所有文件进行查重。

下面分析一下上面的命令。

首先看第一句：

find -not -empty -type f -printf "%sn"

find是查找命令；-not -empty是要寻找非空文件；-type f是指寻找常规文件；-printf “%sn”比较具有迷惑性，这里的%s并非C语言中的输出字符串，它实际表示的是文件的大小，单位为bytes（不懂就man，man一下find，就可以看到了），n是换行符。所以这句话的意思是输出所有非空文件的大小。

搜索公众号GitHub猿后台回复“UML”，获取一份惊喜礼包。

通过管道，上面的结果被传到第二句：

sort -rn

sort是排序，-n是指按大小排序，-r是指从大到小排序（逆序reverse）。

第三句：

uniq -d

uniq是把重复的只输出一次，而-d指只输出重复的部分（如9出现了5次，那么就输出1个9，而2只出现了1次，并非重复出现的数字，故不输出）。

第四句：

xargs -I{} -n1 find -type f -size {}c -print0

这一部分分两部分看，第一部分是xargs -I{} -n1，xargs命令将之前的结果转化为参数，供后面的find调用，其中-I{}是指把参数写成{}，而-n1是指将之前的结果一个一个输入给下一个命令（-n8就是8个8个输入给下一句，不写-n就是把之前的结果一股脑的给下一句）。后半部分是find -type f -size {}c -print0，find指令我们前面见过，-size{}是指找出大小为{}bytes的文件，而-print0则是为了防止文件名里带空格而写的参数。

第五句：

xargs -0 md5sum

xargs我们之前说过，是将前面的结果转化为输入，那么这个-0又是什么意思？man一下xargs，我们看到-0表示读取参数的时候以null为分隔符读取，这也不难理解，毕竟null的二进制表示就是00。后面的md5sum是指计算输入的md5值。

第六句：sort是排序，这个我们前面也见过。

第七句：

uniq -w32 --all-repeated=separate

uniq -w32是指寻找前32个字符相同的行，原因在于md5值一定是32位的，而后面的--all-repeated=separate是指将重复的部分放在一类，分类输出。

第八句：

cut -b 36-

由于我们的结果带着md5值，不是很好看，所以我们截取md5值后面的部分，cut是文本处理函数，这里-b 36-是指只要每行36个字符之后的部分。

我们将上述每个命令用管道链接起来，存入result.txt：

find -not -empty -type f -printf "%sn" | sort -rn |uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -b 36- >result.txt

虽然结果很好看，但是有一个问题，这是在Linux下很好看，实际上如果有朋友把输出文件放到Windows上，就会发现换行全没了，这是由于Linux下的换行是n，而windows要求nr，为了解决这个问题，我们最后执行一条指令，将n转换为nr：

cat result.txt | cut -c 36- | tr -s &#39;n&#39;

The above is the detailed content of Find duplicate files using Linux. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:Linux中文社区. If there is any infringement, please contact admin@php.cn delete

Linux: A Look at Its Fundamental StructureApr 16, 2025 am 12:01 AM

The basic structure of Linux includes the kernel, file system, and shell. 1) Kernel management hardware resources and use uname-r to view the version. 2) The EXT4 file system supports large files and logs and is created using mkfs.ext4. 3) Shell provides command line interaction such as Bash, and lists files using ls-l.

Linux Operations: System Administration and MaintenanceApr 15, 2025 am 12:10 AM

The key steps in Linux system management and maintenance include: 1) Master the basic knowledge, such as file system structure and user management; 2) Carry out system monitoring and resource management, use top, htop and other tools; 3) Use system logs to troubleshoot, use journalctl and other tools; 4) Write automated scripts and task scheduling, use cron tools; 5) implement security management and protection, configure firewalls through iptables; 6) Carry out performance optimization and best practices, adjust kernel parameters and develop good habits.

Understanding Linux's Maintenance Mode: The EssentialsApr 14, 2025 am 12:04 AM

Linux maintenance mode is entered by adding init=/bin/bash or single parameters at startup. 1. Enter maintenance mode: Edit the GRUB menu and add startup parameters. 2. Remount the file system to read and write mode: mount-oremount,rw/. 3. Repair the file system: Use the fsck command, such as fsck/dev/sda1. 4. Back up the data and operate with caution to avoid data loss.

How Debian improves Hadoop data processing speedApr 13, 2025 am 11:54 AM

This article discusses how to improve Hadoop data processing efficiency on Debian systems. Optimization strategies cover hardware upgrades, operating system parameter adjustments, Hadoop configuration modifications, and the use of efficient algorithms and tools. 1. Hardware resource strengthening ensures that all nodes have consistent hardware configurations, especially paying attention to CPU, memory and network equipment performance. Choosing high-performance hardware components is essential to improve overall processing speed. 2. Operating system tunes file descriptors and network connections: Modify the /etc/security/limits.conf file to increase the upper limit of file descriptors and network connections allowed to be opened at the same time by the system. JVM parameter adjustment: Adjust in hadoop-env.sh file

How to learn Debian syslogApr 13, 2025 am 11:51 AM

This guide will guide you to learn how to use Syslog in Debian systems. Syslog is a key service in Linux systems for logging system and application log messages. It helps administrators monitor and analyze system activity to quickly identify and resolve problems. 1. Basic knowledge of Syslog The core functions of Syslog include: centrally collecting and managing log messages; supporting multiple log output formats and target locations (such as files or networks); providing real-time log viewing and filtering functions. 2. Install and configure Syslog (using Rsyslog) The Debian system uses Rsyslog by default. You can install it with the following command: sudoaptupdatesud

How to choose Hadoop version in DebianApr 13, 2025 am 11:48 AM

When choosing a Hadoop version suitable for Debian system, the following key factors need to be considered: 1. Stability and long-term support: For users who pursue stability and security, it is recommended to choose a Debian stable version, such as Debian11 (Bullseye). This version has been fully tested and has a support cycle of up to five years, which can ensure the stable operation of the system. 2. Package update speed: If you need to use the latest Hadoop features and features, you can consider Debian's unstable version (Sid). However, it should be noted that unstable versions may have compatibility issues and stability risks. 3. Community support and resources: Debian has huge community support, which can provide rich documentation and

TigerVNC share file method on DebianApr 13, 2025 am 11:45 AM

This article describes how to use TigerVNC to share files on Debian systems. You need to install the TigerVNC server first and then configure it. 1. Install the TigerVNC server and open the terminal. Update the software package list: sudoaptupdate to install TigerVNC server: sudoaptinstalltigervnc-standalone-servertigervnc-common 2. Configure TigerVNC server to set VNC server password: vncpasswd Start VNC server: vncserver:1-localhostno

Debian mail server firewall configuration tipsApr 13, 2025 am 11:42 AM

Configuring a Debian mail server's firewall is an important step in ensuring server security. The following are several commonly used firewall configuration methods, including the use of iptables and firewalld. Use iptables to configure firewall to install iptables (if not already installed): sudoapt-getupdatesudoapt-getinstalliptablesView current iptables rules: sudoiptables-L configuration

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Atom editor mac version download

The most popular open source editor

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.