Home >Backend Development >C++ >How to Extract Text with Formatting from PDFs Using iTextSharp?

How to Extract Text with Formatting from PDFs Using iTextSharp?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2025-01-11 10:46:41858browse

How to Extract Text with Formatting from PDFs Using iTextSharp?

Extract formatted text using iTextSharp

Introduction:

iTextSharp is a powerful library for manipulating and generating PDF documents, but it is sometimes difficult to extract text with the desired format. This article provides a method to extract text and formatting information from PDF using iTextSharp.

Custom extraction strategy:

To extract formatted text, you can create a custom ITextExtractionStrategy implementation. This policy defines how text rendering information is handled.

Code snippet:

The following code defines a custom strategy that tracks changes in baseline, font name, and font size and generates HTML with appropriate styling:

<code>public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
    // ... (此处省略)

    public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
    {
        // 确定字体属性
        string curFont = renderInfo.GetFont().PostscriptFontName;
        if (renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText)
        {
            curFont += "-Bold";
        }

        // 检查基线、字体或字体大小的变化
        Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
        Single curFontSize = renderInfo.GetAscentLine().GetEndPoint()[Vector.I2] - curBaseline[Vector.I2];
        if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) ||
            (curFontSize != lastFontSize) || (curFont != lastFont))
        {
            // 生成带有更新样式的HTML span
            result.AppendFormat("</code>

Usage:

To use a custom strategy, you can specify it when extracting text:

<code>PdfReader reader = new PdfReader("MyDocument.pdf");
TextWithFontExtractionStategy strategy = new TextWithFontExtractionStategy();
string textWithFormatting = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);</code>

Output:

The

textWithFormatting variable will contain the extracted text with HTML tags reflecting the formatting information, including font and font size.

Conclusion:

This custom extraction strategy allows you to extract PDF text with the desired format. This is a powerful tool that can be used to accurately reproduce text and styles in PDF documents.

The above is the detailed content of How to Extract Text with Formatting from PDFs Using iTextSharp?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn