Update
February 27, 2014: This article originally only described using PDFBox to parse PDF files. It has now been extended to include routines for using IFilter and iTextSharp.
This article and the corresponding Visual Studio project have been updated to the latest PDFBox version (1.8.4). The complete project including all dependencies can be downloaded from http://www.squarepdf.net/how-to-convert-pdf-to-text-in-net-sample-project/ (removing dependencies is a bit tricky).
How to parse PDF files
Several main methods to extract text from PDF files in .NET are:
Microsoft’s IFilter interface and Adobe’s IFilter implementation;
iTextSharp;
PDFBox.
Unfortunately, none of these PDF parsing solutions are perfect. We discuss these methods below.
Adobe PDF IFilter
To use the IFilter interface to parse PDF files, you need:
Windows 2000 or later
Adobe Acrobat or Reader 7.0.5+ (or standalone Adobe PDF IFilter [adobe.com])
IFilter COM encapsulation class [dotlucene.net]
Sample code:
using IFilter; // ... public static string ExtractTextFromPdf(string path) { return DefaultParser.Extract(path); }
Disadvantages:
Uses unreliable COM interop to handle the IFilter interface (and combining IFilter COM and Adobe PDF IFilter is particularly troublesome).
Requires Adobe IFilter to be installed separately on the target system. It's a pain if you need to publish an indexable solution to others.
iTextSharp
iTextSharp (http://sourceforge.net/projects/itextsharp/) is a Java PDF operation library iText (http://itextpdf.com/) .NET output. It's primarily focused on editing PDFs rather than reading them, but it certainly supports extracting text from PDFs as well (although it's a bit overkill).
Routine:
using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; // ... public static string ExtractTextFromPdf(string path) { using (PdfReader reader = new PdfReader(path)) { StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++) { text.Append(PdfTextExtractor.GetTextFromPage(reader, i)); } return text.ToString(); } }
Credit: Member number 10364982
Disadvantages:
Requires a license (if you don’t like AGPL license)
PDFBox
PDFBox is another Java PDF class library. It can also be used with original Java Lucene (see LucenePDFDocument).
Fortunately, PDFBox has a .NET version developed using IKVM.NET (just visit the PDFBox download page).
To use PDFBox in .NET, you need to quote:
IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
pdfbox-1.8.4.dll
And copy the following files to the bin folder :
commons-logging.dll
fontbox-1.8.4.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll
It is very simple to use PDFBox to parse PDF:
using org.apache.pdfbox.pdmodel; using org.apache.pdfbox.util; // ... private static string ExtractTextFromPdf(string path) { PDDocument doc = null; try { doc = PDDocument.load(path) PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); } finally { if (doc != null) { doc.close(); } } }
The compiled size increases It's almost 18MB in total:
IKVM.OpenJDK.Core.dll (4 MB)
IKVM.OpenJDK.SwingAWT.dll (6 MB)
pdfbox-1.8.4.dll (4 MB)
commons-logging. dll (82 kB)
fontbox-1.8.4.dll (180 kB)
IKVM.OpenJDK.Util.dll (2 MB)
IKVM.Runtime.dll (1 MB)
Speed is OK: parsing U.S. Copyright Act PDF (5.1 MB) file took 13 seconds.
Thanks bobrien100 for the improvement suggestions.
Disadvantages:
IKVM.NET dependency (18 MB)
Speed (especially the startup time of IKVM.NET)

The future trends of C#.NET are mainly focused on three aspects: cloud computing, microservices, AI and machine learning integration, and cross-platform development. 1) Cloud computing and microservices: C#.NET optimizes cloud environment performance through the Azure platform and supports the construction of an efficient microservice architecture. 2) Integration of AI and machine learning: With the help of the ML.NET library, C# developers can embed machine learning models in their applications to promote the development of intelligent applications. 3) Cross-platform development: Through .NETCore and .NET5, C# applications can run on Windows, Linux and macOS, expanding the deployment scope.

The latest developments and best practices in C#.NET development include: 1. Asynchronous programming improves application responsiveness, and simplifies non-blocking code using async and await keywords; 2. LINQ provides powerful query functions, efficiently manipulating data through delayed execution and expression trees; 3. Performance optimization suggestions include using asynchronous programming, optimizing LINQ queries, rationally managing memory, improving code readability and maintenance, and writing unit tests.

How to build applications using .NET? Building applications using .NET can be achieved through the following steps: 1) Understand the basics of .NET, including C# language and cross-platform development support; 2) Learn core concepts such as components and working principles of the .NET ecosystem; 3) Master basic and advanced usage, from simple console applications to complex WebAPIs and database operations; 4) Be familiar with common errors and debugging techniques, such as configuration and database connection issues; 5) Application performance optimization and best practices, such as asynchronous programming and caching.

C# is widely used in enterprise-level applications, game development, mobile applications and web development. 1) In enterprise-level applications, C# is often used for ASP.NETCore to develop WebAPI. 2) In game development, C# is combined with the Unity engine to realize role control and other functions. 3) C# supports polymorphism and asynchronous programming to improve code flexibility and application performance.

C# and .NET are suitable for web, desktop and mobile development. 1) In web development, ASP.NETCore supports cross-platform development. 2) Desktop development uses WPF and WinForms, which are suitable for different needs. 3) Mobile development realizes cross-platform applications through Xamarin.

The C#.NET ecosystem provides rich frameworks and libraries to help developers build applications efficiently. 1.ASP.NETCore is used to build high-performance web applications, 2.EntityFrameworkCore is used for database operations. By understanding the use and best practices of these tools, developers can improve the quality and performance of their applications.

How to deploy a C# .NET app to Azure or AWS? The answer is to use AzureAppService and AWSElasticBeanstalk. 1. On Azure, automate deployment using AzureAppService and AzurePipelines. 2. On AWS, use Amazon ElasticBeanstalk and AWSLambda to implement deployment and serverless compute.

The combination of C# and .NET provides developers with a powerful programming environment. 1) C# supports polymorphism and asynchronous programming, 2) .NET provides cross-platform capabilities and concurrent processing mechanisms, which makes them widely used in desktop, web and mobile application development.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

Notepad++7.3.1
Easy-to-use and free code editor

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version
Chinese version, very easy to use

Dreamweaver CS6
Visual web development tools
