집 >백엔드 개발 >C++ >iTextSharp를 사용하여 PDF에서 텍스트 형식 정보를 추출하는 방법은 무엇입니까?

iTextSharp를 사용하여 PDF에서 텍스트 형식 정보를 추출하는 방법은 무엇입니까?

DDD원래의: 2025-01-11 11:13:44361검색

How to Extract Text Formatting Information from PDFs using iTextSharp?

iTextSharp를 사용하여 텍스트 형식 정보 얻기

iTextSharp는 일부 기본 마크업을 처리할 수 있는 간단한 텍스트 추출 시스템을 제공합니다. 색상 정보를 처리하지는 않지만 이 기능을 직접 구현할 수 있습니다. 다음은 다양한 질문과 답변을 결합하여 텍스트를 HTML로 추출하는 동시에 크기와 굵게 등의 글꼴 정보를 캡처하는 수정된 코드 조각입니다.

<code class="language-csharp">using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;

namespace WindowsFormsApplication2
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            PdfReader reader = new PdfReader("Document.pdf");
            TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
            string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
            Console.WriteLine(F);

            this.Close();
        }

        public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
        {
            //HTML缓冲区
            private StringBuilder result = new StringBuilder();

            //存储最后使用的属性
            private Vector lastBaseLine;
            private string lastFont;
            private float lastFontSize;

            //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
            private enum TextRenderMode
            {
                FillText = 0,
                StrokeText = 1,
                FillThenStrokeText = 2,
                Invisible = 3,
                FillTextAndAddToPathForClipping = 4,
                StrokeTextAndAddToPathForClipping = 5,
                FillThenStrokeTextAndAddToPathForClipping = 6,
                AddTextToPaddForClipping = 7
            }



            public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
            {
                string curFont = renderInfo.GetFont().PostscriptFontName;
                //检查是否使用了伪粗体
                if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))
                {
                    curFont += "-Bold";
                }

                //此代码假设如果基线发生变化，则表示换行
                Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
                Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
                iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
                Single curFontSize = rect.Height;

                //查看是否有任何更改，例如基线、字体或字体大小
                if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
                {
                    //如果我们已经放置了一个span标签，则关闭它
                    if ((this.lastBaseLine != null))
                    {
                        this.result.AppendLine("");
                    }
                    //如果基线已更改，则插入换行符
                    if ((this.lastBaseLine != null) &
                    curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
                    {
                        this.result.AppendLine("<br />");
                    }
                    //创建具有适当样式的HTML标签
                    this.result.AppendFormat("</code>

이 코드를 사용하면 PDF 문서에서 텍스트를 추출하는 동시에 글꼴 모음, 크기, 굵게 같은 글꼴 속성도 캡처할 수 있습니다. 코드 조각이 불완전하며 <span> 태그를 만들고 닫고 텍스트 콘텐츠를 추가하여 완전히 실행하려면 보완이 필요합니다.

위 내용은 iTextSharp를 사용하여 PDF에서 텍스트 형식 정보를 추출하는 방법은 무엇입니까?의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

html using

성명：

이전 기사：Entity Framework의 일반 또는 특정 리포지토리: 어떤 접근 방식이 가장 좋습니까?다음 기사：Entity Framework의 일반 또는 특정 리포지토리: 어떤 접근 방식이 가장 좋습니까?