


PHP OCR practice: reading text from images with Tesseract
Optical Character Recognition (OCR) is the process of converting printed text into a digital representation. It has a variety of practical applications – from digitizing printed books and creating electronic records of receipts, to license plate recognition and even cracking image-based captchas.
Tesseract is an open source project that can implement OCR. You can run this project on *Nix systems, Mac systems and Windows systems, but by using a library we can use it in PHP projects. The purpose of this tutorial is to teach you how to use it.
Install
Preparation
To keep things simple and consistent, we will use a virtual machine (this article uses Vagrant) to run the application. This will involve installing PHP and Nginx. We will install Tesseract to demonstrate the process respectively. If you want to install Tesseract yourself on an existing Debian-based system, you can skip the next section—or check out the README for installation instructions on other *nix, Mac systems, or Windows.
Configuring Vagrant
In order to configure Vagrant to follow this tutorial, complete the following steps. Or you can simply get the code from Github.
Enter the following command to download the Homestead Improved Vagrant configuration to a folder named orc:
git clone https://github.com/Swader/homestead_improved ocr
Place the following code in the Nginx configuration file Homestead.yml:
<ol class="dp-c"><li class="alt"><span><span>sites: </span></span></li><li><span> - map: homestead.app </span></li><li class="alt"><span> to: /home/vagrant/Code/Project/<span class="keyword">public</span><span> </span></span></li></ol>
was changed to:
<ol class="dp-c"><li class="alt"><span><span>sites: </span></span></li><li><span> - map: homestead.app </span></li><li class="alt"><span> to: /home/vagrant/Code/<span class="keyword">public</span><span> </span></span></li></ol>
Also add
to the hosts file<ol class="dp-c"><li class="alt"><span><span>192.168.10.10 homestead.app </span></span></li></ol>
Install Tesseract
The next step is to install Tesseract
Because Homestead Improved uses debian, we can use apt-get to install it after logging into the virtual machine using vagrant ssh. Simply run the following command:
<ol class="dp-c"><li class="alt"><span><span>sudo apt-get install tesseract-ocr </span></span></li></ol>
As mentioned above, there are other operating system-specific tutorials in the README.
Test and customize the installation
We will use the PHP wrapper, but before that we can test Tesseract from the command line.
First save this image sign.png
In the virtual machine, execute the following command to read text from the image
<ol class="dp-c"><li class="alt"><span><span>tesseract sign.png out </span></span></li></ol>
This will create a file in the current folder: out.txt which should have the word: CAUTION
Try nowsign2.jpg
<ol class="dp-c"><li class="alt"><span><span>tesseract sign2.jpg out </span></span></li></ol>
This time produces the word Einbahnstral’ie. Close but not correct—although the text in the image is quite clear, it fails to recognize the character ß.
In order for Tesseract to read strings properly, we need to install some new language files - in this case, German.
There is a comprehensive list of available language files here, but let’s just download the ones you need:
<ol class="dp-j"><li class="alt"><span><span>wget https:</span><span class="comment">//tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.deu.tar.gz</span><span> </span></span></li></ol>
Unzip:
<ol class="dp-c"><li class="alt"><span><span>tar zxvf tesseract-ocr-3.02.deu.tar.gz </span></span></li></ol>
Then copy the file to the following directory:
<ol class="dp-c"><li class="alt"><span><span>/usr/share/tesseract-ocr/tessdata </span></span></li></ol>
For example
<ol class="dp-c"><li class="alt"><span><span>cp deu-frak.traineddata /usr/share/tesseract-ocr/tessdata </span></span></li><li><span>cp deu.traineddata /usr/share/tesseract-ocr/tessdata </span></li></ol>
Now we execute the original command again but with –l
<ol class="dp-j"><li class="alt"><span><span>tesseract sign2.jpg out -l deu </span></span></li><li><span> </span></li><li class="alt"><span> “deu” 是德语的 ISO <span class="number">639</span><span>-</span><span class="number">3</span><span>码. </span></span></li></ol>
This time, the text should be Einbahnstraße correct).
Any language can be used by repeating the above process.
Configuration Application
We will use this library to use Tesseract with PHP.
We will build a minimalist web application: users upload images and view the OCR processing results. We will use Silex microframework to achieve this. Don't worry if you're not familiar with it, the app itself is simple.
Remember that all the code for this tutorial is available on Github.
The first step is to use Composer to install dependency files:
<ol class="dp-c"><li class="alt"><span><span>composer </span><span class="keyword">require</span><span> silex/silex twig/twig thiagoalessio/tesseract_ocr:dev-master </span></span></li></ol>
Then create three folders:
<ol class="dp-c"><li class="alt"><span><span>- </span><span class="keyword">public</span><span> </span></span></li><li><span>- uploads </span></li><li class="alt"><span>- views </span></li></ol>
We need to upload the form (viewsindex.twig):
<ol class="dp-c"><li class="alt"><span><span><html> </span></span></li><li><span> <head> </span></li><li class="alt"><span> <title>OCR</title> </span></li><li><span> </head> </span></li><li class="alt"><span> <body> </span></li><li><span> </span></li><li class="alt"><span> <form action=<span class="string">""</span><span> method=</span><span class="string">"post"</span><span> enctype=</span><span class="string">"multipart/form-data"</span><span>> </span></span></li><li><span> <input type=<span class="string">"file"</span><span> name=</span><span class="string">"upload"</span><span>> </span></span></li><li class="alt"><span> <input type=<span class="string">"submit"</span><span>> </span></span></li><li><span> </form> </span></li><li class="alt"><span> </span></li><li><span> </body> </span></li><li class="alt"><span></html> </span></li></ol>
Need a results display page (viewsresults.twig)::
<ol class="dp-c"><li class="alt"><span><span><html> </span></span></li><li><span> <head> </span></li><li class="alt"><span> <title>OCR</title> </span></li><li><span> </head> </span></li><li class="alt"><span> <body> </span></li><li><span> </span></li><li class="alt"><span> <h2 id="Results">Results</h2> </span></li><li><span> </span></li><li class="alt"><span> <textarea cols=<span class="string">"50"</span><span> rows=</span><span class="string">"10"</span><span>>{{ text }}</textarea> </span></span></li><li><span> </span></li><li class="alt"><span> <hr> </span></li><li><span> </span></li><li class="alt"><span> <a href=<span class="string">"/"</span><span>>← Go back</a> </span></span></li><li><span> </span></li><li class="alt"><span> </body> </span></li><li><span></html> </span></li></ol>
Now create skeleton Silex app (publicindex.php):
<ol class="dp-c"><li class="alt"><span><span><php </span></span></li><li><span> </span></li><li class="alt"><span><span class="keyword">require</span><span> __DIR__.</span><span class="string">'/../vendor/autoload.php'</span><span>; </span></span></li><li><span> </span></li><li class="alt"><span><span class="keyword">use</span><span> Symfony\Component\HttpFoundation\Request; </span></span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span> = </span><span class="keyword">new</span><span> Silex\Application(); </span></span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>->register(</span><span class="keyword">new</span><span> Silex\Provider\TwigServiceProvider(), [ </span></span></li><li><span> <span class="string">'twig.path'</span><span> => __DIR__.</span><span class="string">'/../views'</span><span>, </span></span></li><li class="alt"><span>]); </span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>[</span><span class="string">'debug'</span><span>] = true; </span></span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>->get(</span><span class="string">'/'</span><span>, </span><span class="keyword">function</span><span>() </span><span class="keyword">use</span><span> (</span><span class="vars">$app</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="keyword">return</span><span> </span><span class="vars">$app</span><span>[</span><span class="string">'twig'</span><span>]->render(</span><span class="string">'index.twig'</span><span>); </span></span></li><li><span> </span></li><li class="alt"><span>}); </span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>->post(</span><span class="string">'/'</span><span>, </span><span class="keyword">function</span><span>(Request </span><span class="vars">$request</span><span>) </span><span class="keyword">use</span><span> (</span><span class="vars">$app</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// TODO</span><span> </span></span></li><li><span> </span></li><li class="alt"><span>}); </span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>->run(); </span></span></li></ol>
If you access the app in a browser, you should see a file upload form. If you are using Homestead Improved Vagrant, you can access the application through the link below.
<ol class="dp-c"><li class="alt"><span><span>http:</span><span class="comment">//homestead.app/</span><span> </span></span></li></ol>
The next step is to implement file upload. Silex makes this job very simple; $request contains a files component through which we can obtain any uploaded file, code:
<ol class="dp-c"><li class="alt"><span><span class="comment">// Grab the uploaded file</span><span> </span></span></li><li><span><span class="vars">$file</span><span> = </span><span class="vars">$request</span><span>->files->get(</span><span class="string">'upload'</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span><span class="comment">// Extract some information about the uploaded file</span><span> </span></span></li><li class="alt"><span><span class="vars">$info</span><span> = </span><span class="keyword">new</span><span> SplFileInfo(</span><span class="vars">$file</span><span>->getClientOriginalName()); </span></span></li><li><span> </span></li><li class="alt"><span><span class="comment">// Create a quasi-random filename</span><span> </span></span></li><li><span><span class="vars">$filename</span><span> = sprintf(</span><span class="string">'%d.%s'</span><span>, time(), </span><span class="vars">$info</span><span>->getExtension()); </span></span></li><li class="alt"><span> </span></li><li><span><span class="comment">// Copy the file</span><span> </span></span></li><li class="alt"><span><span class="vars">$file</span><span>->move(__DIR__.</span><span class="string">'/../uploads'</span><span>, </span><span class="vars">$filename</span><span>); </span></span></li></ol>
As you can see, we generate random filenames to reduce filename conflicts - but in this application, it doesn't matter how we name the files. Once we have a local copy of the file, we can spawn an instance of the Tessearct library and analyze it:
<ol class="dp-c"><li class="alt"><span><span class="comment">// Instantiate the Tessearct library</span><span> </span></span></li><li><span><span class="vars">$tesseract</span><span> = </span><span class="keyword">new</span><span> TesseractOCR(__DIR__ . </span><span class="string">'/../uploads/'</span><span> . </span><span class="vars">$filename</span><span>); </span></span></li></ol>
Implementing OCR on an image is quite simple, we only need to call the method recognize().
<ol class="dp-c"><li class="alt"><span><span class="comment">// Perform OCR on the uploaded image</span><span> </span></span></li><li><span><span class="vars">$text</span><span> = </span><span class="vars">$tesseract</span><span>->recognize(); </span></span></li></ol>
Finally we display the results on the results page:
<ol class="dp-c"><li class="alt"><span><span class="keyword">return</span><span> </span><span class="vars">$app</span><span>[</span><span class="string">'twig'</span><span>]->render( </span></span></li><li><span> <span class="string">'results.twig'</span><span>, </span></span></li><li class="alt"><span> [ </span></li><li><span> <span class="string">'text'</span><span> => </span><span class="vars">$text</span><span>, </span></span></li><li class="alt"><span> ] </span></li><li><span>); </span></li></ol>
Try it on some images and see how it works. If you have difficulties, you can refer to this
一个实际的例子
让我们来看OCR一个更实用的例子。在本例中,我们尝试在图像中找到一个格式化的电话号码。
看看下面一幅图,上传到你的应用:
结果应该如下:
<ol class="dp-c"><li class="alt"><span><span>:ii‘i </span></span></li><li><span>Customer Service Helplines </span></li><li class="alt"><span> </span></li><li><span>British Airways Helpline </span></li><li class="alt"><span> </span></li><li><span>09040 490 541 </span></li></ol>
它没有挑出正文文本,这是我们能料到的,因为图片质量太差。虽然识别了号码但是也有一些“噪声”。
为了提取相关信息,有如下几件事我们可以做。
你可以让Tesseract 把它的结果限制在一定的字符集内,所以我们告诉它只返回数字型的内容代码如下:
<ol class="dp-c"><li class="alt"><span><span class="vars">$tesseract</span><span>->setWhitelist(range(0,9)); </span></span></li></ol>
但这样有个问题。它常常把非数字字符解释成数字而非忽略它们。比如“Bob”可能被解释称数字“808”。
所以我们采用两步处理。
第一步,我们可以用一个基本的正则表达式。可以用谷歌电话库来确定一个数字串是否是合法电话号码。
备注:我已在Sitepoint 写过关于谷歌电话库的内容。
让我们给谷歌电话库添加一个PHP 端口,修改composer.json,添加:
<ol class="dp-c"><li class="alt"><span><span class="string">"giggsey/libphonenumber-for-php"</span><span>: </span><span class="string">"~7.0"</span><span> </span></span></li></ol>
别忘了升级:
<ol class="dp-c"><li class="alt"><span><span>composer update </span></span></li></ol>
现在我们可以写一个函数,输入为一个字符串,尝试提取一个合法的电话号码
<ol class="dp-c"><li class="alt"><span><span class="comment">/**</span> </span></li><li><span><span class="comment">* Parse a string, trying to find a valid telephone number. As soon as it finds a</span> </span></li><li class="alt"><span><span class="comment">* valid number, it'll return it in E1624 format. If it can't find any, it'll</span> </span></li><li><span><span class="comment">* simply return NULL.</span> </span></li><li class="alt"><span><span class="comment">*</span> </span></li><li><span><span class="comment">* @param string $text The string to parse</span> </span></li><li class="alt"><span><span class="comment">* @param string $country_code The two digit country code to use as a "hint"</span> </span></li><li><span><span class="comment">* @return string | NULL</span> </span></li><li class="alt"><span><span class="comment">*/</span><span> </span></span></li><li><span><span class="keyword">function</span><span> findPhoneNumber(</span><span class="vars">$text</span><span>, </span><span class="vars">$country_code</span><span> = </span><span class="string">'GB'</span><span>) { </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Get an instance of Google's libphonenumber</span><span> </span></span></li><li class="alt"><span> <span class="vars">$phoneUtil</span><span> = \libphonenumber\PhoneNumberUtil::getInstance(); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Use a simple regular expression to try and find candidate phone numbers</span><span> </span></span></li><li><span> preg_match_all(<span class="string">'/(\+\d+)?\s*(\(\d+\))?([\s-]?\d+)+/'</span><span>, </span><span class="vars">$text</span><span>, </span><span class="vars">$matches</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Iterate through the matches</span><span> </span></span></li><li class="alt"><span> <span class="keyword">foreach</span><span> (</span><span class="vars">$matches</span><span> </span><span class="keyword">as</span><span> </span><span class="vars">$match</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="keyword">foreach</span><span> (</span><span class="vars">$match</span><span> </span><span class="keyword">as</span><span> </span><span class="vars">$value</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> try { </span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Attempt to parse the number</span><span> </span></span></li><li><span> <span class="vars">$number</span><span> = </span><span class="vars">$phoneUtil</span><span>->parse(trim(</span><span class="vars">$value</span><span>), </span><span class="vars">$country_code</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Just because we parsed it successfully, doesn't make it vald - so check it</span><span> </span></span></li><li class="alt"><span> <span class="keyword">if</span><span> (</span><span class="vars">$phoneUtil</span><span>->isValidNumber(</span><span class="vars">$number</span><span>)) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// We've found a telephone number. Format using E.164, and exit</span><span> </span></span></li><li><span> <span class="keyword">return</span><span> </span><span class="vars">$phoneUtil</span><span>->format(</span><span class="vars">$number</span><span>, \libphonenumber\PhoneNumberFormat::E164); </span></span></li><li class="alt"><span> </span></li><li><span> } </span></li><li class="alt"><span> </span></li><li><span> } catch (\libphonenumber\NumberParseException <span class="vars">$e</span><span>) { </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Ignore silently; getting here simply means we found something that isn't a phone number</span><span> </span></span></li><li class="alt"><span> </span></li><li><span> } </span></li><li class="alt"><span> </span></li><li><span> } </span></li><li class="alt"><span> } </span></li><li><span> </span></li><li class="alt"><span> <span class="keyword">return</span><span> null; </span></span></li><li><span> </span></li><li class="alt"><span>} </span></li></ol>
希望注释能解释这个函数在干什么。注意如果这个库没能从字符串中解析出一个合法的电话号码它会抛出一个异常。这不是什么问题;我们直接忽略它并继续下一个候选字符。
如果我们找到一个电话号码,我们以E.164的形式返回它。这提供了一个国际化的号码,我们可以用来打电话或者发送SMS。
现在我们可以如下使用:
<ol class="dp-c"><li class="alt"><span><span class="vars">$text</span><span> = </span><span class="vars">$tesseract</span><span>->recognize(); </span></span></li><li><span><span class="vars">$number</span><span> = findPhoneNumber(</span><span class="vars">$text</span><span>, </span><span class="string">'GB'</span><span>); </span></span></li></ol>
我们需要给谷歌电话库提供一个提示来说明这个号码是哪个国家的。你也可以改成你自己的国家。
我们把所有的这些打包在一个新的路由中:
<ol class="dp-c"><li class="alt"><span><span class="vars">$app</span><span>->post(</span><span class="string">'/identify-telephone-number'</span><span>, </span><span class="keyword">function</span><span>(Request </span><span class="vars">$request</span><span>) </span><span class="keyword">use</span><span> (</span><span class="vars">$app</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Grab the uploaded file</span><span> </span></span></li><li><span> <span class="vars">$file</span><span> = </span><span class="vars">$request</span><span>->files->get(</span><span class="string">'upload'</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Extract some information about the uploaded file</span><span> </span></span></li><li class="alt"><span> <span class="vars">$info</span><span> = </span><span class="keyword">new</span><span> SplFileInfo(</span><span class="vars">$file</span><span>->getClientOriginalName()); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Create a quasi-random filename</span><span> </span></span></li><li><span> <span class="vars">$filename</span><span> = sprintf(</span><span class="string">'%d.%s'</span><span>, time(), </span><span class="vars">$info</span><span>->getExtension()); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Copy the file</span><span> </span></span></li><li class="alt"><span> <span class="vars">$file</span><span>->move(__DIR__.</span><span class="string">'/../uploads'</span><span>, </span><span class="vars">$filename</span><span>); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Instantiate the Tessearct library</span><span> </span></span></li><li><span> <span class="vars">$tesseract</span><span> = </span><span class="keyword">new</span><span> TesseractOCR(__DIR__ . </span><span class="string">'/../uploads/'</span><span> . </span><span class="vars">$filename</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Perform OCR on the uploaded image</span><span> </span></span></li><li class="alt"><span> <span class="vars">$text</span><span> = </span><span class="vars">$tesseract</span><span>->recognize(); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="vars">$number</span><span> = findPhoneNumber(</span><span class="vars">$text</span><span>, </span><span class="string">'GB'</span><span>); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="keyword">return</span><span> </span><span class="vars">$app</span><span>->json( </span></span></li><li><span> [ </span></li><li class="alt"><span> <span class="string">'number'</span><span> => </span><span class="vars">$number</span><span>, </span></span></li><li><span> ] </span></li><li class="alt"><span> ); </span></li><li><span> </span></li><li class="alt"><span>}); </span></li></ol>
我们现在有简单的API的基础—-也就是JSON响应-—我们可以用来作为一个简单的移动应用的后端,这款应用可以用来从一幅图中添加联系人,打电话。
总结
OCR有许多应用——并且很容易整合进你的应用超过你的预期)。本文中,我们安装了开源OCR包;并使用一个包装器库,把它整合进一个非常简单的PHP应用。我们只是触及到了所有可能性的表面,希望这能给你一些想法,帮你想想怎么在你自己的应用中使用OCR。
译文链接:http://www.codeceo.com/article/php-ocr-tesseract-get-text.html
英文原文:OCR in PHP: Read Text from Images with Tesseract

php把负数转为正整数的方法:1、使用abs()函数将负数转为正数,使用intval()函数对正数取整,转为正整数,语法“intval(abs($number))”;2、利用“~”位运算符将负数取反加一,语法“~$number + 1”。

实现方法:1、使用“sleep(延迟秒数)”语句,可延迟执行函数若干秒;2、使用“time_nanosleep(延迟秒数,延迟纳秒数)”语句,可延迟执行函数若干秒和纳秒;3、使用“time_sleep_until(time()+7)”语句。

php除以100保留两位小数的方法:1、利用“/”运算符进行除法运算,语法“数值 / 100”;2、使用“number_format(除法结果, 2)”或“sprintf("%.2f",除法结果)”语句进行四舍五入的处理值,并保留两位小数。

判断方法:1、使用“strtotime("年-月-日")”语句将给定的年月日转换为时间戳格式;2、用“date("z",时间戳)+1”语句计算指定时间戳是一年的第几天。date()返回的天数是从0开始计算的,因此真实天数需要在此基础上加1。

php字符串有下标。在PHP中,下标不仅可以应用于数组和对象,还可应用于字符串,利用字符串的下标和中括号“[]”可以访问指定索引位置的字符,并对该字符进行读写,语法“字符串名[下标值]”;字符串的下标值(索引值)只能是整数类型,起始值为0。

方法:1、用“str_replace(" ","其他字符",$str)”语句,可将nbsp符替换为其他字符;2、用“preg_replace("/(\s|\ \;||\xc2\xa0)/","其他字符",$str)”语句。

在php中,可以使用substr()函数来读取字符串后几个字符,只需要将该函数的第二个参数设置为负值,第三个参数省略即可;语法为“substr(字符串,-n)”,表示读取从字符串结尾处向前数第n个字符开始,直到字符串结尾的全部字符。

php判断有没有小数点的方法:1、使用“strpos(数字字符串,'.')”语法,如果返回小数点在字符串中第一次出现的位置,则有小数点;2、使用“strrpos(数字字符串,'.')”语句,如果返回小数点在字符串中最后一次出现的位置,则有。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 English version
Recommended: Win version, supports code prompts!
