search
HomeBackend DevelopmentPython TutorialThe power of Scrapy: How to recognize and process verification codes?
The power of Scrapy: How to recognize and process verification codes?Jun 22, 2023 pm 03:09 PM
Verification codedeal withscrapy

Scrapy is a powerful Python framework that helps us crawl data on websites easily. However, we run into problems when the website we want to crawl has a verification code. The purpose of CAPTCHAs is to prevent automated crawlers from attacking a website, so they tend to be highly complex and difficult to crack. In this post, we’ll cover how to use the Scrapy framework to identify and process CAPTCHAs to allow our crawlers to bypass these defenses.

What is a verification code?

Captcha is a test used to prove that the user is a real human being and not a machine. It is usually an obfuscated text string or an indecipherable image that requires the user to manually enter or select what is displayed. CAPTCHAs are designed to catch automated bots and scripts to protect websites from malicious attacks and abuse.

There are usually three types of CAPTCHAs:

  1. Text CAPTCHA: Users need to copy and paste a string of text to prove they are a human user and not a bot.
  2. Number verification code: The user is required to enter the displayed number in the input box.
  3. Image verification code: The user is required to enter the characters or numbers in the displayed image in the input box. This is usually the most difficult type to crack because the characters or numbers in the image can be distorted, misplaced or Has other visual noise.

Why do you need to process verification codes?

Crawlers are often automated on a large scale, so they can easily be identified as robots and banned from websites from obtaining data. CAPTCHAs were introduced to prevent this from happening. Once ep enters the verification code stage, the Scrapy crawler will stop waiting for user input, and therefore cannot continue to crawl data, resulting in a decrease in the efficiency and integrity of the crawler.

Therefore, we need a way to handle the verification code so that our crawler can automatically pass and continue its task. Usually we use third-party tools or APIs to complete the recognition of verification codes. These tools and APIs use machine learning and image processing algorithms to recognize images and characters, and return the results to our program.

How to handle verification codes in Scrapy?

Open Scrapy's settings.py file, we need to modify the DOWNLOADER_MIDDLEWARES field and add the following proxy:

DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 350,'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 400,
'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,'scrapy. contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
'scrapy.contrib. downloadermiddleware.ajaxcrawl.AjaxCrawlMiddleware': 900,'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 800,
'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,'scrapy.contrib.downloadermiddleware.stats.DownloaderSt ats ': 850,
'tutorial.middlewares.CaptchaMiddleware': 999}

In this example, we use CaptchaMiddleware to handle the verification code. CaptchMiddleware is a custom middleware class that will handle the download request and call the API to identify the verification code when needed, then fill the verification code into the request and return to continue execution.

Code example:

class CaptchaMiddleware(object):

def __init__(self):
    self.client = CaptchaClient()
    self.max_attempts = 5

def process_request(self, request, spider):
    # 如果没有设置dont_filter则默认开启
    if not request.meta.get('dont_filter', False):
        request.meta['dont_filter'] = True

    if 'captcha' in request.meta:
        # 带有验证码信息
        captcha = request.meta['captcha']
        request.meta.pop('captcha')
    else:
        # 没有验证码则获取
        captcha = self.get_captcha(request.url, logger=spider.logger)

    if captcha:
        # 如果有验证码则添加到请求头
        request = request.replace(
            headers={
                'Captcha-Code': captcha,
                'Captcha-Type': 'math',
            }
        )
        spider.logger.debug(f'has captcha: {captcha}')

    return request

def process_response(self, request, response, spider):
    # 如果没有验证码或者验证码失败则不重试
    need_retry = 'Captcha-Code' in request.headers.keys()
    if not need_retry:
        return response

    # 如果已经尝试过,则不再重试
    retry_times = request.meta.get('retry_times', 0)
    if retry_times >= self.max_attempts:
        return response

    # 验证码校验失败则重试
    result = self.client.check(request.url, request.headers['Captcha-Code'])
    if not result:
        spider.logger.warning(f'Captcha check fail: {request.url}')
        return request.replace(
            meta={
                'captcha': self.get_captcha(request.url, logger=spider.logger),
                'retry_times': retry_times + 1,
            },
            dont_filter=True,
        )

    # 验证码校验成功则继续执行
    spider.logger.debug(f'Captcha check success: {request.url}')
    return response

def get_captcha(self, url, logger=None):
    captcha = self.client.solve(url)
    if captcha:
        if logger:
            logger.debug(f'get captcha [0:4]: {captcha[0:4]}')
        return captcha

    return None

In this middleware, we use the CaptchaClient object as the captcha solution middleware, we can use multiple A captcha solution middleware.

Notes

When implementing this middleware, please pay attention to the following points:

  1. The identification and processing of verification codes require the use of third-party tools or APIs. We need to make sure we have legal licenses and use them according to the manufacturer's requirements.
  2. After adding such middleware, the request process will become more complex, and developers need to test and debug carefully to ensure that the program can work properly.

Conclusion

By using the Scrapy framework and the middleware for verification code recognition and processing, we can effectively bypass the verification code defense strategy and achieve effective crawling of the target website. This method usually saves time and effort than manually entering verification codes, and is more efficient and accurate. However, it is important to note that you read and comply with the license agreements and requirements of third-party tools and APIs before using them.

The above is the detailed content of The power of Scrapy: How to recognize and process verification codes?. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
手机为什么收不到验证码手机为什么收不到验证码Aug 17, 2023 pm 02:49 PM

手机收不到验证码是网络问题、手机设置问题、手机运营商问题和个人设置问题导致的。详情介绍:1、网络问题,手机所处的网络环境不稳定或者信号弱,就有可能导致验证码无法及时送达;2、手机设置问题,不小心将手机的短信或语音功能关闭,或者将验证码的发送号码加入到黑名单中,从而导致验证码无法正常收到;3、手机运营商问题,手机运营商可能会出现故障或者维护,导致验证码无法及时送达等等。

PHP图片处理案例:如何实现图片的验证码功能PHP图片处理案例:如何实现图片的验证码功能Aug 17, 2023 pm 12:09 PM

PHP图片处理案例:如何实现图片的验证码功能随着互联网的快速发展,验证码成为了保护网站安全的重要手段之一。验证码是一种通过图像识别技术来确定用户是否为真实用户的验证方式。本文将介绍如何使用PHP来实现图片的验证码功能,并附带代码示例。简介验证码是一张包含随机字符的图片,用户需要输入图片中的字符才能通过验证。实现验证码的主要过程包括生成随机字符、绘制字符到图片

验证码拦不住机器人了!谷歌AI已能精准识别模糊文字,GPT-4则装瞎求人帮忙验证码拦不住机器人了!谷歌AI已能精准识别模糊文字,GPT-4则装瞎求人帮忙Apr 12, 2023 am 09:46 AM

“最烦登网站时各种奇奇怪怪(甚至变态)的验证码了。”现在,有一个好消息和一个坏消息。好消息就是:AI可以帮你代劳这件事了。不信你瞧,以下是三张识别难度依次递增的真实案例:而这些是一个名为“Pix2Struct”的模型给出的答案:全部准确无误、一字不差有没有?有网友感叹:确定,准确性比我强。所以可不可以做成浏览器插件??不错,有人表示:别看这几个案例相比还算简单,但凡微调一下,我都不敢想象其效果有多厉害了。所以,坏消息就是——验证码马上就要拦不住机器人了!(危险危险危险……)如何做到?Pix2St

PHP开发指南:实现验证码登录PHP开发指南:实现验证码登录Jul 01, 2023 am 09:27 AM

随着互联网的发展和智能手机的普及,验证码登录功能被越来越多的网站和应用程序采用。验证码登录是一种通过输入正确的验证码来验证用户身份的登录方式,以提高安全性和防止恶意攻击。在PHP开发中,实现简单的验证码登录功能并不复杂,可以通过以下步骤来完成。创建数据库表首先,我们需要在数据库中创建一个用于存储验证码信息的表。表结构可以包含以下字段:id:自增主键phon

用OCR技术,自动识别各种验证码,工具已开源用OCR技术,自动识别各种验证码,工具已开源May 25, 2023 am 10:07 AM

今天我在给大家分享一个OCR​应用——ddddocr自动识别验证码。前面4个d是“带带弟弟”的首拼音。[/笑哭]。项目地址:https://github.com/sml2h3/ddddocr。使用的时候用pip​命令直接安装即可pipinstallddddocr。OCR的核心技术包含两方面,一是目标检测模型检测图片中的文字,二是文字识别模型,将图片中的文字转成文本文字。第一类验证码最简单,它们没有复杂的背景图片,所以目标检测模型可以省略,直接将图片送入文字识别模型即可。识别代码如下:impor

如何使用PHP创建验证码图片?如何使用PHP创建验证码图片?Sep 13, 2023 am 11:40 AM

如何使用PHP创建验证码图片?验证码(CAPTCHA)是一种常用的验证用户是否为人而不是机器的方法。在网站上,我们经常会看到验证码图片,要求用户输入图片上显示的随机字符或数字,以完成登录、注册、评论等操作。本文将介绍如何使用PHP创建验证码图片,并提供具体的代码示例。一、PHPGD库要创建验证码图片,我们需要使用PHP的GD库。GD库是一个用于处理图像的扩

react怎么实现手机验证码react怎么实现手机验证码Jan 04, 2023 am 10:17 AM

react实现手机验证码的方法:1、下载antd button和input组件;2、通过“<Input className={`apiMobileInput`} disabled value={this.props.phoneNumber} />”获取客户的手机号;3、通过“await this.props.sendCode({...})”实现获取验证码即可。

如何使用 JavaScript 实现验证码功能?如何使用 JavaScript 实现验证码功能?Oct 19, 2023 am 10:46 AM

如何使用JavaScript实现验证码功能?随着网络的发展,验证码已经成为了网站和应用程序中不可或缺的安全机制之一。验证码(VerificationCode)是一种用于判断用户是否为人类而不是机器的技术。通过验证码,网站和应用程序可以防止垃圾信息提交、恶意攻击、机器人爬虫等问题。本文将介绍如何使用JavaScript实现验证码功能,并提供具体的代码

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),