search
HomeBackend DevelopmentC#.Net TutorialC# implements the function of converting PDF to text

Update

February 27, 2014: This article originally only described using PDFBox to parse PDF files. It has now been extended to include routines for using IFilter and iTextSharp.

 This article and the corresponding Visual Studio project have been updated to the latest PDFBox version (1.8.4). The complete project including all dependencies can be downloaded from http://www.squarepdf.net/how-to-convert-pdf-to-text-in-net-sample-project/ (removing dependencies is a bit tricky).

 How to parse PDF files

  Several main methods to extract text from PDF files in .NET are:

Microsoft’s IFilter interface and Adobe’s IFilter implementation;

iTextSharp;

PDFBox.

 Unfortunately, none of these PDF parsing solutions are perfect. We discuss these methods below.

 Adobe PDF IFilter

 To use the IFilter interface to parse PDF files, you need:

Windows 2000 or later

Adobe Acrobat or Reader 7.0.5+ (or standalone Adobe PDF IFilter [adobe.com])

IFilter COM encapsulation class [dotlucene.net]

Sample code:

using IFilter;
 
// ...
 
public static string ExtractTextFromPdf(string path) {
  return DefaultParser.Extract(path); 
}

Disadvantages:

Uses unreliable COM interop to handle the IFilter interface (and combining IFilter COM and Adobe PDF IFilter is particularly troublesome).

Requires Adobe IFilter to be installed separately on the target system. It's a pain if you need to publish an indexable solution to others.

iTextSharp

iTextSharp (http://sourceforge.net/projects/itextsharp/) is a Java PDF operation library iText (http://itextpdf.com/) .NET output. It's primarily focused on editing PDFs rather than reading them, but it certainly supports extracting text from PDFs as well (although it's a bit overkill).

  Routine:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
 
// ...
  
public static string ExtractTextFromPdf(string path)
{
  using (PdfReader reader = new PdfReader(path))
  {
    StringBuilder text = new StringBuilder();
 
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
    }
 
    return text.ToString();
  }
}

Credit: Member number 10364982

Disadvantages:

Requires a license (if you don’t like AGPL license)

PDFBox

PDFBox is another Java PDF class library. It can also be used with original Java Lucene (see LucenePDFDocument).

Fortunately, PDFBox has a .NET version developed using IKVM.NET (just visit the PDFBox download page).

 To use PDFBox in .NET, you need to quote:

IKVM.OpenJDK.Core.dll

IKVM.OpenJDK.SwingAWT.dll

pdfbox-1.8.4.dll

 And copy the following files to the bin folder :

commons-logging.dll

fontbox-1.8.4.dll

IKVM.OpenJDK.Util.dll

IKVM.Runtime.dll

It is very simple to use PDFBox to parse PDF:

using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
 
// ...
 
private static string ExtractTextFromPdf(string path)
{
  PDDocument doc = null;
  try {
    doc = PDDocument.load(path)
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
  }
  finally {
    if (doc != null) {
      doc.close();
    }
  }
}

The compiled size increases It's almost 18MB in total:

IKVM.OpenJDK.Core.dll (4 MB)

IKVM.OpenJDK.SwingAWT.dll (6 MB)

pdfbox-1.8.4.dll (4 MB)

commons-logging. dll (82 kB)

fontbox-1.8.4.dll (180 kB)

IKVM.OpenJDK.Util.dll (2 MB)

IKVM.Runtime.dll (1 MB)

  Speed ​​is OK: parsing U.S. Copyright Act PDF (5.1 MB) file took 13 seconds.

 Thanks bobrien100 for the improvement suggestions.

 Disadvantages:

IKVM.NET dependency (18 MB)

Speed ​​(especially the startup time of IKVM.NET)


Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
The Future of C# .NET: Trends and OpportunitiesThe Future of C# .NET: Trends and OpportunitiesApr 29, 2025 am 12:02 AM

The future trends of C#.NET are mainly focused on three aspects: cloud computing, microservices, AI and machine learning integration, and cross-platform development. 1) Cloud computing and microservices: C#.NET optimizes cloud environment performance through the Azure platform and supports the construction of an efficient microservice architecture. 2) Integration of AI and machine learning: With the help of the ML.NET library, C# developers can embed machine learning models in their applications to promote the development of intelligent applications. 3) Cross-platform development: Through .NETCore and .NET5, C# applications can run on Windows, Linux and macOS, expanding the deployment scope.

C# .NET Development Today: Trends and Best PracticesC# .NET Development Today: Trends and Best PracticesApr 28, 2025 am 12:25 AM

The latest developments and best practices in C#.NET development include: 1. Asynchronous programming improves application responsiveness, and simplifies non-blocking code using async and await keywords; 2. LINQ provides powerful query functions, efficiently manipulating data through delayed execution and expression trees; 3. Performance optimization suggestions include using asynchronous programming, optimizing LINQ queries, rationally managing memory, improving code readability and maintenance, and writing unit tests.

C# .NET: Building Applications with the .NET EcosystemC# .NET: Building Applications with the .NET EcosystemApr 27, 2025 am 12:12 AM

How to build applications using .NET? Building applications using .NET can be achieved through the following steps: 1) Understand the basics of .NET, including C# language and cross-platform development support; 2) Learn core concepts such as components and working principles of the .NET ecosystem; 3) Master basic and advanced usage, from simple console applications to complex WebAPIs and database operations; 4) Be familiar with common errors and debugging techniques, such as configuration and database connection issues; 5) Application performance optimization and best practices, such as asynchronous programming and caching.

C# as a Versatile .NET Language: Applications and ExamplesC# as a Versatile .NET Language: Applications and ExamplesApr 26, 2025 am 12:26 AM

C# is widely used in enterprise-level applications, game development, mobile applications and web development. 1) In enterprise-level applications, C# is often used for ASP.NETCore to develop WebAPI. 2) In game development, C# is combined with the Unity engine to realize role control and other functions. 3) C# supports polymorphism and asynchronous programming to improve code flexibility and application performance.

C# .NET for Web, Desktop, and Mobile DevelopmentC# .NET for Web, Desktop, and Mobile DevelopmentApr 25, 2025 am 12:01 AM

C# and .NET are suitable for web, desktop and mobile development. 1) In web development, ASP.NETCore supports cross-platform development. 2) Desktop development uses WPF and WinForms, which are suitable for different needs. 3) Mobile development realizes cross-platform applications through Xamarin.

C# .NET Ecosystem: Frameworks, Libraries, and ToolsC# .NET Ecosystem: Frameworks, Libraries, and ToolsApr 24, 2025 am 12:02 AM

The C#.NET ecosystem provides rich frameworks and libraries to help developers build applications efficiently. 1.ASP.NETCore is used to build high-performance web applications, 2.EntityFrameworkCore is used for database operations. By understanding the use and best practices of these tools, developers can improve the quality and performance of their applications.

Deploying C# .NET Applications to Azure/AWS: A Step-by-Step GuideDeploying C# .NET Applications to Azure/AWS: A Step-by-Step GuideApr 23, 2025 am 12:06 AM

How to deploy a C# .NET app to Azure or AWS? The answer is to use AzureAppService and AWSElasticBeanstalk. 1. On Azure, automate deployment using AzureAppService and AzurePipelines. 2. On AWS, use Amazon ElasticBeanstalk and AWSLambda to implement deployment and serverless compute.

C# .NET: An Introduction to the Powerful Programming LanguageC# .NET: An Introduction to the Powerful Programming LanguageApr 22, 2025 am 12:04 AM

The combination of C# and .NET provides developers with a powerful programming environment. 1) C# supports polymorphism and asynchronous programming, 2) .NET provides cross-platform capabilities and concurrent processing mechanisms, which makes them widely used in desktop, web and mobile application development.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools