Home >Backend Development >C++ >How to Extract Text with Formatting from PDFs Using iTextSharp?
Extract formatted text using iTextSharp
Introduction:
iTextSharp is a powerful library for manipulating and generating PDF documents, but it is sometimes difficult to extract text with the desired format. This article provides a method to extract text and formatting information from PDF using iTextSharp.
Custom extraction strategy:
To extract formatted text, you can create a custom ITextExtractionStrategy implementation. This policy defines how text rendering information is handled.
Code snippet:
The following code defines a custom strategy that tracks changes in baseline, font name, and font size and generates HTML with appropriate styling:
<code>public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy { // ... (此处省略) public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) { // 确定字体属性 string curFont = renderInfo.GetFont().PostscriptFontName; if (renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText) { curFont += "-Bold"; } // 检查基线、字体或字体大小的变化 Vector curBaseline = renderInfo.GetBaseline().GetStartPoint(); Single curFontSize = renderInfo.GetAscentLine().GetEndPoint()[Vector.I2] - curBaseline[Vector.I2]; if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont)) { // 生成带有更新样式的HTML span result.AppendFormat("</code>
Usage:
To use a custom strategy, you can specify it when extracting text:
<code>PdfReader reader = new PdfReader("MyDocument.pdf"); TextWithFontExtractionStategy strategy = new TextWithFontExtractionStategy(); string textWithFormatting = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);</code>
Output:
ThetextWithFormatting variable will contain the extracted text with HTML tags reflecting the formatting information, including font and font size.
Conclusion:
This custom extraction strategy allows you to extract PDF text with the desired format. This is a powerful tool that can be used to accurately reproduce text and styles in PDF documents.
The above is the detailed content of How to Extract Text with Formatting from PDFs Using iTextSharp?. For more information, please follow other related articles on the PHP Chinese website!