最近想用Python写个爬虫去抓取一些东西,但是碰到个问题,就是验证码不知道该如何处理。
现在验证码一般有两种,一种是简单的,比如下面这种纯字符型的:
另外一种就是出来一些特定字符,需要按顺序点击的:
我看有的人说可以获取浏览器cookies写到程序里就直接通过验证了,有的说这个涉及到机器学习方面的东西。由于我个人以前没接触过这方面东西,所以不知道从何处入手,想问下要处理这种验证码的话,一般该如何处理? 有没有这方面合适的书推荐下啊……
迷茫2017-04-18 10:35:47
This itself uses verification code technology to prevent network programs such as crawlers. What I know about cracking verification codes is to use artificial intelligence image recognition. It seems that there are similar functions available, but the accuracy is not too high.
黄舟2017-04-18 10:35:47
For verification code issues, firstly, you can turn to the API provided by professional service providers (they use machine learning or artificial intelligence), such as Youyoutu; secondly, you can write your own verification code recognition program and provide a project for reference: https://github.com /luyishisi/...
迷茫2017-04-18 10:35:47
One solution is to manually log in to the browser and then extract the cookies and directly include them in the crawler request and send them out.
PHPz2017-04-18 10:35:47
Picture 1 is easy to process, the verification code is just a picture, and the verification code can be obtained through picture processing (ocr technology);
Picture 2 is more troublesome. If you use the first method, its numbers will be overlaid on the text. After getting the picture The content is more difficult. I don’t have any good methods for the second method. I hope students with experience in this field can help me with the answer
天蓬老师2017-04-18 10:35:47
Verification code is used to counter machines and crawlers. If the verification code can be easily bypassed by your automated crawler, can it still be called a verification code? The author should first find out what the mechanism of the verification code is, and then see if it is true. You can imagine that it can be easily bypassed. In short, unless there are loopholes in the verification code implementation of other websites, you cannot bypass the verification code mechanism. You can only recognize the text on the verification code, such as OCR (Optical Character Recognition) technology It is used to solve this problem. OCR refers to the process in which an electronic device (such as a scanner) checks the characters printed on the paper. It determines its shape by detecting the dark/light pattern, and then uses character recognition methods to translate the shape into computer text.
Basic steps for verification code recognition:
1. Preprocessing
2. Grayscale
3. Binarization
4. Denoising
5. Segmentation
6. Recognition
In short, the verification code identification threshold is high and the cost is high, so it is unavoidable.
For example, in the picture below, the verification codes are staggered and overlapping, making it difficult to identify.
怪我咯2017-04-18 10:35:47
The easiest way is to take out the cookies and write them in the code, but cookies are time-sensitive
大家讲道理2017-04-18 10:35:47
To deal with complex verification codes, the more efficient and time-saving method should be to connect to the coding platform and let their manual processing.