Home >Backend Development >C++ >How to Extract Text and Images from PDFs using iTextSharp in .NET?

How to Extract Text and Images from PDFs using iTextSharp in .NET?

DDD
DDDOriginal
2025-01-06 07:51:41933browse

How to Extract Text and Images from PDFs using iTextSharp in .NET?

Extracting PDF Content with iTextSharp in .NET

In .NET applications, iTextSharp provides robust capabilities for handling PDF documents. One of its primary features is the ability to extract content from PDFs, including both text and images.

Reading Plain Text from PDFs

To read plain text from a PDF using iTextSharp, you can leverage the following code:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

public string ReadPdfText(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

In this example, the ReadPdfText method reads the contents of a PDF file and accumulates the text into a StringBuilder object. The SimpleTextExtractionStrategy is used to extract text from each page of the PDF.

Handling Images in PDFs

While the above code focuses on extracting text, iTextSharp also enables you to extract images from PDFs. You can use the following approach:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Drawing;
using System.IO;

public void ReadPdfImages(string fileName)
{
    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);
            string content = parser.ProcessContent(page, new ImageRenderListener());
        }
    }
}

In this code, a PdfReaderContentParser is used to parse the content of each page. The ImageRenderListener provides a callback method that handles the rendering of images. Each image is rendered as a Bitmap object, which can be further processed or saved.

The above is the detailed content of How to Extract Text and Images from PDFs using iTextSharp in .NET?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn