Update
February 27, 2014: This article originally only described using PDFBox to parse PDF files. It has now been extended to include routines for using IFilter and iTextSharp.
This article and the corresponding Visual Studio project have been updated to the latest PDFBox version (1.8.4). The complete project including all dependencies can be downloaded from http://www.squarepdf.net/how-to-convert-pdf-to-text-in-net-sample-project/ (removing dependencies is a bit tricky).
How to parse PDF files
Several main methods to extract text from PDF files in .NET are:
Microsoft’s IFilter interface and Adobe’s IFilter implementation;
iTextSharp;
PDFBox.
Unfortunately, none of these PDF parsing solutions are perfect. We discuss these methods below.
Adobe PDF IFilter
To use the IFilter interface to parse PDF files, you need:
Windows 2000 or later
Adobe Acrobat or Reader 7.0.5+ (or standalone Adobe PDF IFilter [adobe.com])
IFilter COM encapsulation class [dotlucene.net]
Sample code:
using IFilter; // ... public static string ExtractTextFromPdf(string path) { return DefaultParser.Extract(path); }
Disadvantages:
Uses unreliable COM interop to handle the IFilter interface (and combining IFilter COM and Adobe PDF IFilter is particularly troublesome).
Requires Adobe IFilter to be installed separately on the target system. It's a pain if you need to publish an indexable solution to others.
iTextSharp
iTextSharp (http://sourceforge.net/projects/itextsharp/) is a Java PDF operation library iText (http://itextpdf.com/) .NET output. It's primarily focused on editing PDFs rather than reading them, but it certainly supports extracting text from PDFs as well (although it's a bit overkill).
Routine:
using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; // ... public static string ExtractTextFromPdf(string path) { using (PdfReader reader = new PdfReader(path)) { StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++) { text.Append(PdfTextExtractor.GetTextFromPage(reader, i)); } return text.ToString(); } }
Credit: Member number 10364982
Disadvantages:
Requires a license (if you don’t like AGPL license)
PDFBox
PDFBox is another Java PDF class library. It can also be used with original Java Lucene (see LucenePDFDocument).
Fortunately, PDFBox has a .NET version developed using IKVM.NET (just visit the PDFBox download page).
To use PDFBox in .NET, you need to quote:
IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
pdfbox-1.8.4.dll
And copy the following files to the bin folder :
commons-logging.dll
fontbox-1.8.4.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll
It is very simple to use PDFBox to parse PDF:
using org.apache.pdfbox.pdmodel; using org.apache.pdfbox.util; // ... private static string ExtractTextFromPdf(string path) { PDDocument doc = null; try { doc = PDDocument.load(path) PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); } finally { if (doc != null) { doc.close(); } } }
The compiled size increases It's almost 18MB in total:
IKVM.OpenJDK.Core.dll (4 MB)
IKVM.OpenJDK.SwingAWT.dll (6 MB)
pdfbox-1.8.4.dll (4 MB)
commons-logging. dll (82 kB)
fontbox-1.8.4.dll (180 kB)
IKVM.OpenJDK.Util.dll (2 MB)
IKVM.Runtime.dll (1 MB)
Speed is OK: parsing U.S. Copyright Act PDF (5.1 MB) file took 13 seconds.
Thanks bobrien100 for the improvement suggestions.
Disadvantages:
IKVM.NET dependency (18 MB)
Speed (especially the startup time of IKVM.NET)

如何使用C#编写时间序列预测算法时间序列预测是一种通过分析过去的数据来预测未来数据趋势的方法。它在很多领域,如金融、销售和天气预报中有广泛的应用。在本文中,我们将介绍如何使用C#编写时间序列预测算法,并附上具体的代码示例。数据准备在进行时间序列预测之前,首先需要准备好数据。一般来说,时间序列数据应该具有足够的长度,并且是按照时间顺序排列的。你可以从数据库或者

如何使用Redis和C#开发分布式事务功能引言分布式系统的开发中,事务处理是一项非常重要的功能。事务处理能够保证在分布式系统中的一系列操作要么全部成功,要么全部回滚。Redis是一种高性能的键值存储数据库,而C#是一种广泛应用于开发分布式系统的编程语言。本文将介绍如何使用Redis和C#来实现分布式事务功能,并提供具体代码示例。I.Redis事务Redis

如何实现C#中的人脸识别算法人脸识别算法是计算机视觉领域中的一个重要研究方向,它可以用于识别和验证人脸,广泛应用于安全监控、人脸支付、人脸解锁等领域。在本文中,我们将介绍如何使用C#来实现人脸识别算法,并提供具体的代码示例。实现人脸识别算法的第一步是获取图像数据。在C#中,我们可以使用EmguCV库(OpenCV的C#封装)来处理图像。首先,我们需要在项目

如何使用C#编写动态规划算法摘要:动态规划是求解最优化问题的一种常用算法,适用于多种场景。本文将介绍如何使用C#编写动态规划算法,并提供具体的代码示例。一、什么是动态规划算法动态规划(DynamicProgramming,简称DP)是一种用来求解具有重叠子问题和最优子结构性质的问题的算法思想。动态规划将问题分解成若干个子问题来求解,通过记录每个子问题的解,

Redis在C#开发中的应用:如何实现高效的缓存更新引言:在Web开发中,缓存是提高系统性能的常用手段之一。而Redis作为一款高性能的Key-Value存储系统,能够提供快速的缓存操作,为我们的应用带来了不少便利。本文将介绍如何在C#开发中使用Redis,实现高效的缓存更新。Redis的安装与配置在开始之前,我们需要先安装Redis并进行相应的配置。你可以

C#开发中如何处理跨域请求和安全性问题在现代的网络应用开发中,跨域请求和安全性问题是开发人员经常面临的挑战。为了提供更好的用户体验和功能,应用程序经常需要与其他域或服务器进行交互。然而,浏览器的同源策略导致了这些跨域请求被阻止,因此需要采取一些措施来处理跨域请求。同时,为了保证数据的安全性,开发人员还需要考虑一些安全性问题。本文将探讨C#开发中如何处理跨域请

如何实现C#中的图像压缩算法摘要:图像压缩是图像处理领域中的一个重要研究方向,本文将介绍在C#中实现图像压缩的算法,并给出相应的代码示例。引言:随着数字图像的广泛应用,图像压缩成为了图像处理中的重要环节。压缩能够减小存储空间和传输带宽,并能提高图像处理的效率。在C#语言中,我们可以通过使用各种图像压缩算法来实现对图像的压缩。本文将介绍两种常见的图像压缩算法:

如何在C#中实现遗传算法引言:遗传算法是一种模拟自然选择和基因遗传机制的优化算法,其主要思想是通过模拟生物进化的过程来搜索最优解。在计算机科学领域,遗传算法被广泛应用于优化问题的解决,例如机器学习、参数优化、组合优化等。本文将介绍如何在C#中实现遗传算法,并提供具体的代码示例。一、遗传算法的基本原理遗传算法通过使用编码表示解空间中的候选解,并利用选择、交叉和


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft
