search
HomeBackend DevelopmentPython TutorialUnlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial

Unlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial

Extracting text from a PDF is usually straightforward when it's in English and doesn't have embedded fonts. However, once those assumptions are removed, it becomes challenging to use basic python libraries like pdfminer or pdfplumber. Last month, I was tasked with extracting text from a Gujarati-language PDF and importing data fields such as name, address, city, etc., into JSON format.

If the font is embedded in the PDF itself, simple copy-pasting won't work, and using pdfplumber will return unreadable junk text. Therefore, I had to convert each PDF page to an image and then apply OCR using the pytesseract library to "scan" the page instead of just reading it. This tutorial will show you how to do just that.

Things you will need

  • pdfplumber (Python library)
  • pdf2image (Python library)
  • pytesseract (Python library)
  • tesseract-ocr

You can install the Python libraries using pip commands as shown below. For Tesseract-OCR, download and install the software from the official site. pytesseract is just a wrapper around the tesseract software.

pip install pdfplumber
pip install pdf2image
pip install pytesseract

Converting the PDF page to an image

The first step is to convert your PDF page to an image. This extract_text_from_pdf() function does exactly that-you pass the PDF path and the page_num (zero indexed) as parameters. Note that I'm converting the page to black and white first for clarity, this is optional.

# Extract text from a specific page of a PDF
def extract_text_from_pdf(pdf_path, page_num):
    # Use pdfplumber to open the PDF
    pdf = pdfplumber.open(pdf_path)
    print(f"extracting page {page_num}..")
    page = pdf.pages[page_num]
    images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1)
    image = images[0]
    # Convert to black and white
    bw_image = convert_to_bw(image)
    # Save the B&W image for debugging (optional)
    #bw_image.save("bw_page.png")
    # Perform OCR on the B&W image
    e_text = ocr_image(bw_image)
    open('out.txt', 'w', encoding='utf-8').write(e_text)
    #print("output written to file.")
    try:
        process_text(page_num, e_text)
    except Exception as e:
        print("Error occurred:", e)
    print("done..")

# Convert image to black and white
def convert_to_bw(image):
    # Convert to grayscale
    gray = image.convert('L')
    # Apply threshold to convert to pure black and white
    bw = gray.point(lambda x: 0 if x 



<p>The ocr_image() function uses pytesseract to extract text from the image through OCR. The technical parameters like --oem and --psm control how the image is processed, and the -l guj eng parameter sets the languages to be read. Since this PDF contained occasional English text, I used guj eng.</p>

<h2>
  
  
  Processing the text
</h2>

<p>Once you've imported the text using OCR, you can parse it in the format you want. This works similarly to other PDF libraries like pdfplumber or pypdf2.<br>
</p>

<pre class="brush:php;toolbar:false">nums = ['0', '૧', '૨', '૩', '૪', '૫', '૬', '૭', '૮', '૯']

def process_text(page_num, e_text):
    obj = None
    last_surname = None
    last_kramank = None
    print(f"processing page {page_num}..")
    for line in e_text.splitlines():
        line = line.replace('|', '').replace('[', '').replace(']', '')
        parts = [word for word in line.split(' ') if word]
        if len(parts) == 0: continue
        new_rec = True
        for char in parts[0]:
            if char not in nums:
                new_rec = False
                break
        if len(parts) = 2: # numbered line
            if len(parts) 



<p>Every PDF has its own nuances that must be accounted for. In this case, a new serial number (like 0૧ or 0૨) in the first field signaled a new group when the subsequent field (surname) changed.</p>

<p>pytesseract is a testament to the evolution and advancement in IT technology. About a decade ago, reading or parsing a PDF image using OCR in a non-English language on a modestly configured PC or laptop would have been nearly impossible. This is truly progress! Happy coding, and let me know how it goes in the comments below.</p><h2>
  
  
  References
</h2>

  • Tesseract installation in windows
  • Use pytesseract OCR to recognize text from an image
  • How to configure pytesseract to support text detection for non English language in windows 10?

The above is the detailed content of Unlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Merging Lists in Python: Choosing the Right MethodMerging Lists in Python: Choosing the Right MethodMay 14, 2025 am 12:11 AM

TomergelistsinPython,youcanusethe operator,extendmethod,listcomprehension,oritertools.chain,eachwithspecificadvantages:1)The operatorissimplebutlessefficientforlargelists;2)extendismemory-efficientbutmodifiestheoriginallist;3)listcomprehensionoffersf

How to concatenate two lists in python 3?How to concatenate two lists in python 3?May 14, 2025 am 12:09 AM

In Python 3, two lists can be connected through a variety of methods: 1) Use operator, which is suitable for small lists, but is inefficient for large lists; 2) Use extend method, which is suitable for large lists, with high memory efficiency, but will modify the original list; 3) Use * operator, which is suitable for merging multiple lists, without modifying the original list; 4) Use itertools.chain, which is suitable for large data sets, with high memory efficiency.

Python concatenate list stringsPython concatenate list stringsMay 14, 2025 am 12:08 AM

Using the join() method is the most efficient way to connect strings from lists in Python. 1) Use the join() method to be efficient and easy to read. 2) The cycle uses operators inefficiently for large lists. 3) The combination of list comprehension and join() is suitable for scenarios that require conversion. 4) The reduce() method is suitable for other types of reductions, but is inefficient for string concatenation. The complete sentence ends.

Python execution, what is that?Python execution, what is that?May 14, 2025 am 12:06 AM

PythonexecutionistheprocessoftransformingPythoncodeintoexecutableinstructions.1)Theinterpreterreadsthecode,convertingitintobytecode,whichthePythonVirtualMachine(PVM)executes.2)TheGlobalInterpreterLock(GIL)managesthreadexecution,potentiallylimitingmul

Python: what are the key featuresPython: what are the key featuresMay 14, 2025 am 12:02 AM

Key features of Python include: 1. The syntax is concise and easy to understand, suitable for beginners; 2. Dynamic type system, improving development speed; 3. Rich standard library, supporting multiple tasks; 4. Strong community and ecosystem, providing extensive support; 5. Interpretation, suitable for scripting and rapid prototyping; 6. Multi-paradigm support, suitable for various programming styles.

Python: compiler or Interpreter?Python: compiler or Interpreter?May 13, 2025 am 12:10 AM

Python is an interpreted language, but it also includes the compilation process. 1) Python code is first compiled into bytecode. 2) Bytecode is interpreted and executed by Python virtual machine. 3) This hybrid mechanism makes Python both flexible and efficient, but not as fast as a fully compiled language.

Python For Loop vs While Loop: When to Use Which?Python For Loop vs While Loop: When to Use Which?May 13, 2025 am 12:07 AM

Useaforloopwheniteratingoverasequenceorforaspecificnumberoftimes;useawhileloopwhencontinuinguntilaconditionismet.Forloopsareidealforknownsequences,whilewhileloopssuitsituationswithundeterminediterations.

Python loops: The most common errorsPython loops: The most common errorsMay 13, 2025 am 12:07 AM

Pythonloopscanleadtoerrorslikeinfiniteloops,modifyinglistsduringiteration,off-by-oneerrors,zero-indexingissues,andnestedloopinefficiencies.Toavoidthese:1)Use'i

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),