Home  >  Article  >  Backend Development  >  How Can You Extract Images from PDFs Using Python While Preserving Their Original Resolution?

How Can You Extract Images from PDFs Using Python While Preserving Their Original Resolution?

DDD
DDDOriginal
2024-10-22 07:52:30574browse

How Can You Extract Images from PDFs Using Python While Preserving Their Original Resolution?

Extracting Images from PDFs without Resampling Using Python

To efficiently extract all images from a PDF document while preserving their native resolution and format without resampling, you can utilize the PyMuPDF module. This module provides an effective solution for image extraction, outputting images as PNG files.

Using PyMuPDF:

<code class="python">import fitz

# Open the PDF document
doc = fitz.open("file.pdf")

# Iterate through the pages
for i in range(len(doc)):
    # Extract images from the current page
    for img in doc.getPageImageList(i):
        # Retrieve the image's XREF and create a Pixmap
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)

        # Check if the image is grayscale or RGB
        if pix.n < 5:
            # Save the image in PNG format
            pix.writePNG("p%s-%s.png" % (i, xref))

        # If the image is CMYK, convert it to RGB and save
        else:
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None

        # Release the Pixmaps
        pix = None</code>

Enhancements:

For an updated version of the script that supports fitz 1.19.6:

<code class="python">import os
import fitz
from tqdm import tqdm

# Specify the work directory
workdir = "your_folder"

# Iterate through the PDFs in the directory
for each_path in os.listdir(workdir):
    if ".pdf" in each_path:
        # Open the PDF document
        doc = fitz.Document(os.path.join(workdir, each_path))

        for i in tqdm(range(len(doc)), desc="pages"):
            for img in tqdm(doc.get_page_images(i), desc="page_images"):
                # Extract the image and save as PNG
                xref = img[0]
                image = doc.extract_image(xref)
                pix = fitz.Pixmap(doc, xref)
                pix.save(os.path.join(workdir, "%s_p%s-%s.png" % (each_path[:-4], i, xref)))</code>

This enhanced script provides progress bars for added visibility and saves the extracted images with consistent file naming conventions.

The above is the detailed content of How Can You Extract Images from PDFs Using Python While Preserving Their Original Resolution?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn