使用 Python 递归合并 PDF-Python教程-PHP中文网

首页

后端开发

Python教程

使用 Python 递归合并 PDF

Susan Sarandon

Dec 29, 2024 pm 09:16 PM

Merge PDFs Recursively Using Python

介绍

将多个 PDF 文件合并到一个文档中可能是一项繁琐的任务，尤其是当文件分布在多个目录中时。使用 Python，这项任务变得无缝且自动化。在本教程中，我们将使用 PyPDF2 创建一个命令行界面 (CLI) 工具，然后单击合并目录（包括其子目录）中的所有 PDF 文件，同时排除 .venv 和 .git 等特定目录。

先决条件

开始之前，请确保您具备以下条件：

Python：版本 3.7 或更高版本。
pip：Python 的包管理器。
所需的库：
- 安装 PyPDF2 进行 PDF 操作：
```
 pip install PyPDF2
```

安装单击以创建 CLI：
```
 pip install click
```

代码演练

这是我们的 CLI 工具的完整代码：

import click
from pathlib import Path
from PyPDF2 import PdfMerger
import os

EXCLUDED_DIRS = {".venv", ".git"}

@click.command()
@click.argument("directory", type=click.Path(exists=True, file_okay=False, path_type=Path))
@click.argument("output_file", type=click.Path(dir_okay=False, writable=True, path_type=Path))
def merge_pdfs(directory: Path, output_file: Path):
    """
    Merge all PDF files from DIRECTORY and its subdirectories into OUTPUT_FILE,
    excluding specified directories like .venv and .git.
    """
    # Initialize the PdfMerger
    merger = PdfMerger()

    # Walk through the directory tree, including the base directory
    for root, dirs, files in os.walk(directory):
        # Exclude specific directories
        dirs[:] = [d for d in dirs if d not in EXCLUDED_DIRS]

        # Convert the root to a Path object
        current_dir = Path(root)

        click.echo(f"Processing directory: {current_dir}")

        # Collect PDF files in the current directory
        pdf_files = sorted(current_dir.glob("*.pdf"))

        if not pdf_files:
            click.echo(f"No PDF files found in {current_dir}")
            continue

        # Add PDF files from the current directory
        for pdf in pdf_files:
            click.echo(f"Adding {pdf}...")
            merger.append(str(pdf))

    # Write the merged output file
    output_file.parent.mkdir(parents=True, exist_ok=True)
    merger.write(str(output_file))
    merger.close()

    click.echo(f"All PDFs merged into {output_file}")

if __name__ == "__main__":
    merge_pdfs()

它是如何运作的

目录遍历:
- os.walk()函数递归遍历指定目录。
- 使用目录过滤器排除特定目录（例如 .venv、.git）。
PDF 文件集合:
- current_dir.glob("*.pdf") 收集当前目录下的所有 PDF 文件。
合并 PDF:
- PyPDF2 中的 PdfMerger 用于附加所有 PDF。
- 合并后的输出写入指定文件。
CLI 集成:
- 点击库可以轻松提供目录和输出文件路径作为参数。

运行工具

将代码保存到文件中，例如 merge_pdfs.py。从终端运行它，如下所示：

python merge_pdfs.py /path/to/directory /path/to/output.pdf

例子

假设您有以下目录结构：

/documents
├── file1.pdf
├── subdir1
│   ├── file2.pdf
├── subdir2
│   ├── file3.pdf
├── .git
│   ├── ignored_file.pdf

按如下方式运行该工具：

python merge_pdfs.py /documents /merged.pdf

这会将 file1.pdf、file2.pdf 和 file3.pdf 合并为 merged.pdf，跳过 .git。

特征

递归合并:
- 该工具自动包含所有子目录中的 PDF。
目录排除:
- 排除 .venv 和 .git 等目录以避免不相关的文件。
排序合并:
- 确保 PDF 按排序顺序添加以获得一致的结果。
CLI 简单性:
- 为用户提供直观的界面来指定输入和输出路径。

注意事项和限制

大文件：
- 合并大量 PDF 可能会消耗大量内存。首先使用较小的数据集进行测试。
PDF 兼容性:
- 确保所有输入的 PDF 有效且未损坏。
自定义排除：
- 修改 EXCLUDED_DIRS 设置以根据需要排除其他目录。

结论

本教程演示如何使用 Python 自动合并目录结构中的 PDF。提供的 CLI 工具非常灵活，可以适应更复杂的工作流程。尝试一下，让我们知道它如何为您服务！

编码愉快！？

以上是使用 Python 递归合并 PDF的详细内容。更多信息请关注PHP中文网其他相关文章！

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

python中两个列表的串联替代方案是什么？May 09, 2025 am 12:16 AM

可以使用多种方法在Python中连接两个列表：1.使用操作符，简单但在大列表中效率低；2.使用extend方法，效率高但会修改原列表；3.使用 =操作符，兼具效率和可读性；4.使用itertools.chain函数，内存效率高但需额外导入；5.使用列表解析，优雅但可能过于复杂。选择方法应根据代码上下文和需求。

Python：合并两个列表的有效方法May 09, 2025 am 12:15 AM

有多种方法可以合并Python列表：1.使用操作符，简单但对大列表不内存高效；2.使用extend方法，内存高效但会修改原列表；3.使用itertools.chain，适用于大数据集；4.使用*操作符，一行代码合并小到中型列表；5.使用numpy.concatenate，适用于大数据集和性能要求高的场景；6.使用append方法，适用于小列表但效率低。选择方法时需考虑列表大小和应用场景。

编译的与解释的语言：优点和缺点May 09, 2025 am 12:06 AM

CompiledLanguagesOffersPeedAndSecurity，而interneterpretledlanguages provideeaseafuseanDoctability.1）commiledlanguageslikec arefasterandSecureButhOnderDevevelmendeclementCyclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesclesandentency.2）cransportedeplatectentysenty

Python：对于循环，最完整的指南May 09, 2025 am 12:05 AM

Python中，for循环用于遍历可迭代对象，while循环用于条件满足时重复执行操作。1）for循环示例：遍历列表并打印元素。2）while循环示例：猜数字游戏，直到猜对为止。掌握循环原理和优化技巧可提高代码效率和可靠性。

python concatenate列表到一个字符串中May 09, 2025 am 12:02 AM

要将列表连接成字符串，Python中使用join()方法是最佳选择。1)使用join()方法将列表元素连接成字符串，如''.join(my_list)。2)对于包含数字的列表，先用map(str,numbers)转换为字符串再连接。3)可以使用生成器表达式进行复杂格式化，如','.join(f'({fruit})'forfruitinfruits)。4)处理混合数据类型时，使用map(str,mixed_list)确保所有元素可转换为字符串。5)对于大型列表，使用''.join(large_li

Python的混合方法：编译和解释合并May 08, 2025 am 12:16 AM

pythonuseshybridapprace，ComminingCompilationTobyTecoDeAndInterpretation.1）codeiscompiledtoplatform-Indepententbybytecode.2）bytecodeisisterpretedbybythepbybythepythonvirtualmachine，增强效率和通用性。

了解python的' for”和' then”循环之间的差异May 08, 2025 am 12:11 AM

theKeyDifferencesBetnewpython's“ for”和“ for”和“ loopsare：1）” for“ loopsareIdealForiteringSequenceSquencesSorkNowniterations，而2）”，而“ loopsareBetterforConterContinuingUntilacTientInditionIntionismetismetistismetistwithOutpredefinedInedIterations.un