Python verification code recognition tutorial: grayscale processing, binarization, noise reduction and tesserocr recognition-Python Tutorial-php.cn

Python verification code recognition tutorial: grayscale processing, binarization, noise reduction and tesserocr recognition

不言

Jun 04, 2018 am 11:30 AM

pythonTutorialGrayscale

This article mainly introduces the grayscale processing, binarization, noise reduction and tesserocr recognition of the python verification code recognition tutorial. It has a certain reference value. Now I share it with you. Friends in need can refer to it

Preface

An unavoidable problem when writing crawlers is the verification code. Now there are about 4 types of verification codes:

Image class
Sliding class
Click class
Voice Category

Today, let’s take a look at the image category. Most of these verification codes are a combination of numbers and letters, and Chinese characters are also used in China. On this basis, noise, interference lines, deformation, overlap, different font colors and other methods are added to increase the difficulty of recognition.

Correspondingly, verification code recognition can be roughly divided into the following steps:

Grayscale processing
Increase contrast (optional)
Binarization
Noise reduction
Tilt correction segmentation characters
Establish training database
Recognition

Because it is experimental in nature , the verification codes used in this article are all generated by programs rather than downloading real website verification codes in batches. The advantage of this is that there can be a large number of data sets with clear results.

When you need to obtain data in a real environment, you can use various large-code platforms to build a data set for training.

To generate the verification code, I use the Claptcha (local download) library. Of course, the Captcha (local download) library is also a good choice.

In order to generate the simplest purely digital, interference-free verification code, we first need to make some modifications to _drawLine on line 285 of claptcha.py. I directly let this function return None and then start generating verification. Code:

from claptcha import Claptcha

c = Claptcha("8069","/usr/share/fonts/truetype/freefont/FreeMono.ttf")
t,_ = c.write(&#39;1.png&#39;)

You need to pay attention to the font path of ubuntu here, you can also download other fonts online for use. The verification code is generated as follows:

It can be seen that the verification code is deformed. For this type of simplest verification code, you can directly use Google's open source tesserocr to identify it.

First install:

apt-get install tesseract-ocr libtesseract-dev libleptonica-dev
pip install tesserocr

Then Start recognition:

from PIL import Image
import tesserocr

p1 = Image.open(&#39;1.png&#39;)
tesserocr.image_to_text(p1)

&#39;8069\n\n&#39;

It can be seen that for this simple verification code, the recognition rate is already very high without basically doing anything. Taller. Interested friends can use more data to test, but I will not go into details here.

Next, add noise to the background of the verification code to see:

c = Claptcha("8069","/usr/share/fonts/truetype/freefont/FreeMono.ttf",noise=0.4)
t,_ = c.write(&#39;2.png&#39;)

Generate the verification code as follows:

Identification:

p2 = Image.open(&#39;2.png&#39;)
tesserocr.image_to_text(p2)
&#39;8069\n\n&#39;

The effect is okay. Next, generate an alphanumeric combination:

c2 = Claptcha("A4oO0zZ2","/usr/share/fonts/truetype/freefont/FreeMono.ttf")
t,_ = c2.write(&#39;3.png&#39;)

Generate a verification code as follows:

The third one is the lowercase letter o, the fourth one is the uppercase letter O, the fifth one is the number 0, the sixth one is the lowercase letter z, the seventh one is the uppercase letter Z, and the last one is the number 2. Is it true that human eyes have already knelt down! But now general verification codes do not strictly distinguish between upper and lower case. Let’s see what the automatic recognition looks like:

p3 = Image.open(&#39;3.png&#39;)
tesserocr.image_to_text(p3)
&#39;AMOOZW\n\n&#39;

Of course a computer that can even kneel with human eyes is useless. . However, for some cases where the interference is small and the deformation is not serious, it is very simple and convenient to use tesserocr. Then restore line 285 of the modified claptcha.py _drawLine to see if interference lines are added.

p4 = Image.open(&#39;4.png&#39;)
tesserocr.image_to_text(p4)
&#39;&#39;

After adding an interference line, it cannot be recognized at all. So is there any way to remove the interference line?

Although the picture looks black and white, it still needs to be processed in grayscale. Otherwise, using the load() function, you will get an RGB tuple of a certain pixel instead of a single value. The processing is as follows:

def binarizing(img,threshold):
 """传入image对象进行灰度、二值处理"""
 img = img.convert("L") # 转灰度
 pixdata = img.load()
 w, h = img.size
 # 遍历所有像素，大于阈值的为黑色
 for y in range(h):
  for x in range(w):
   if pixdata[x, y] < threshold:
    pixdata[x, y] = 0
   else:
    pixdata[x, y] = 255
 return img

The processed picture is as follows:

You can see the processing Afterwards, the picture was sharpened a lot. Next, I tried to remove the interference lines using the common 4-neighborhood and 8-neighborhood algorithms. The so-called X neighborhood algorithm can refer to the nine-square grid input method on mobile phones. Button 5 is the pixel to be judged. Neighbor 4 is to judge up, down, left and right, and neighborhood 8 is to judge the surrounding 8 pixels. If the number of 255 among these 4 or 8 points is greater than a certain threshold, this point is judged to be noise. The threshold can be modified according to the actual situation.

def depoint(img):
 """传入二值化后的图片进行降噪"""
 pixdata = img.load()
 w,h = img.size
 for y in range(1,h-1):
  for x in range(1,w-1):
   count = 0
   if pixdata[x,y-1] > 245:#上
    count = count + 1
   if pixdata[x,y+1] > 245:#下
    count = count + 1
   if pixdata[x-1,y] > 245:#左
    count = count + 1
   if pixdata[x+1,y] > 245:#右
    count = count + 1
   if pixdata[x-1,y-1] > 245:#左上
    count = count + 1
   if pixdata[x-1,y+1] > 245:#左下
    count = count + 1
   if pixdata[x+1,y-1] > 245:#右上
    count = count + 1
   if pixdata[x+1,y+1] > 245:#右下
    count = count + 1
   if count > 4:
    pixdata[x,y] = 255
 return img

The processed pictures are as follows:

好像……根本没卵用啊？！确实是这样的，因为示例中的图片干扰线的宽度和数字是一样的。对于干扰线和数据像素不同的，比如Captcha生成的验证码：

从左到右依次是原图、二值化、去除干扰线的情况，总体降噪的效果还是比较明显的。另外降噪可以多次执行，比如我对上面的降噪后结果再进行依次降噪，可以得到下面的效果：

再进行识别得到了结果：

p7 = Image.open(&#39;7.png&#39;)
tesserocr.image_to_text(p7)
&#39;8069 ,,\n\n&#39;

另外，从图片来看，实际数据颜色明显和噪点干扰线不同，根据这一点可以直接把噪点全部去除，这里就不展开说了。
第一篇文章，先记录如何将图片进行灰度处理、二值化、降噪，并结合tesserocr来识别简单的验证码，剩下的部分在下一篇文章中和大家一起分享。

相关推荐：

Python验证码识别处理实例

The above is the detailed content of Python verification code recognition tutorial: grayscale processing, binarization, noise reduction and tesserocr recognition. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

详细讲解Python之Seaborn（数据可视化）Apr 21, 2022 pm 06:08 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于Seaborn的相关问题，包括了数据可视化处理的散点图、折线图、条形图等等内容，下面一起来看一下，希望对大家有帮助。

详细了解Python进程池与进程锁May 10, 2022 pm 06:11 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于进程池与进程锁的相关问题，包括进程池的创建模块，进程池函数等等内容，下面一起来看一下，希望对大家有帮助。

Python自动化实践之筛选简历Jun 07, 2022 pm 06:59 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于简历筛选的相关问题，包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容，下面一起来看一下，希望对大家有帮助。

归纳总结Python标准库May 03, 2022 am 09:00 AM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于标准库总结的相关问题，下面一起来看一下，希望对大家有帮助。

分享10款高效的VSCode插件，总有一款能够惊艳到你！！Mar 09, 2021 am 10:15 AM

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件，能够让原本单薄的VS Code如虎添翼，开发效率顿时提升到一个新的阶段。

Python数据类型详解之字符串、数字Apr 27, 2022 pm 07:27 PM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于数据类型之字符串、数字的相关问题，下面一起来看一下，希望对大家有帮助。

详细介绍python的numpy模块May 19, 2022 am 11:43 AM

本篇文章给大家带来了关于Python的相关知识，其中主要介绍了关于numpy模块的相关问题，Numpy是Numerical Python extensions的缩写，字面意思是Python数值计算扩展，下面一起来看一下，希望对大家有帮助。

python中文是什么意思Jun 24, 2019 pm 02:22 PM

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间，Guido van Rossum在家闲的没事干，为了跟朋友庆祝圣诞节，决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python，所以便把这门语言叫做python。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks agoByDDD

Two Point Museum: All Exhibits And Where To Find Them

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Zend Studio 13.0.1

Powerful PHP integrated development environment

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Hot Topics

Where is the login entrance for gmail email?

7371

1628

1355

1266

1215