Home >Backend Development >C++ >How Can I Retrieve Text Formatting (Font, Size, Style) from a PDF Using iTextSharp?

How Can I Retrieve Text Formatting (Font, Size, Style) from a PDF Using iTextSharp?

Barbara Streisand
Barbara StreisandOriginal
2025-01-11 10:56:42495browse

How Can I Retrieve Text Formatting (Font, Size, Style) from a PDF Using iTextSharp?

How to extract text format using iTextSharp

Although iTextSharp provides an efficient text extraction method, it may have shortcomings in retaining formatting details such as fonts, colors, and sizes. To overcome this limitation, we explored an alternative approach.

Customized text extraction strategy

The custom TextWithFontExtractionStategy class extends the ITextExtractionStrategy interface to capture format information. In the RenderText method:

  • It monitors font names, pseudo-bold usage, baseline changes, and font size changes.
  • If any of these attributes change, it will close the current HTML span tag and create a new one with the corresponding styles.

Example output

The following C# code demonstrates how to extract text and font-related formatting from a PDF:

<code class="language-csharp">StringBuilder result = new StringBuilder();
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Document.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
Console.WriteLine(F);</code>

The generated HTML output contains tags for font family, font size, and font style.

Other considerations

  • PostscriptFontName may contain additional characters, which may be related to font subsetting.
  • The example code assumes that changes in the baseline represent newlines in HTML.
  • The extraction process currently does not capture color information, but there are indications that this can be achieved manually.

The above is the detailed content of How Can I Retrieve Text Formatting (Font, Size, Style) from a PDF Using iTextSharp?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn