Home > Article > Backend Development > Python crawler entry-level identification verification code
Preliminary situation: The content mentioned in this article was done by the blogger during the last summer vacation. I never settled down enough to write down my thoughts on paper. Fortunately, I have more free time during this holiday, so I thought I could I can write as much as I want, so this article is here.
I won’t say much about the introduction of verification codes. Various verification codes appear from time to time in people’s lives. As a student of Northeastern University, the blogger has the most daily contact with academic affairs. The system verification code is here.
Dongda’s verification code has been complained by students. It is too difficult to enter. It is not only case-sensitive, but sometimes you have entered it correctly, but an error message appears. At this time, prohibits your left-click copying
Maybe it's time to pop up.
(However, the Academic Affairs Office changed the content of the verification code in the 201Python crawler entry-level identification verification code-17 academic year to make it more convenient for humans to operate.)
It can be seen that the verification code of the Academic Affairs Office is very There are rules, and the size, position, shape, etc. of each letter and number are fixed, which is suitable for beginners with no foundation to identify verification codes.
Simulated login has complicated steps. Here, regardless of other operations, we are only responsible for returning an answer string based on an input verification code image.
We know that the verification code will make the picture colorful in order to create interference, and we first need to remove these interferences. This step requires continuous experimentation, enhancing the color of the picture, increasing the contrast, etc. can help.
After various operations on the pictures, I finally found a more perfect solution for removing interference. It can be seen that after removing the interference, under optimal circumstances, we will get a very pure black and white character picture. There are four characters in a picture. It is impossible to recognize all four characters at once. The picture needs to be cropped so that each small picture has only one character, and then each picture is recognized separately.
#The next step is to recognize the text, we First, convert the obtained small picture into a matrix represented by 01, each matrix represents a character.For example, the matrix of the number six
num_Python crawler entry-level identification verification codePython crawler entry-level identification verification code[ 0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,1,1,0,0,0,0,0,0, 0,0,0,0,1,1,1,0,0,0,0,0,0, 0,0,0,1,1,1,0,0,0,0,0,0,0, 0,0,0,1,1,0,0,0,0,0,0,0,0, 0,0,1,1,0,0,0,0,0,0,0,0,0, 0,0,1,1,0,0,0,0,0,0,0,0,0, 0,1,1,1,1,1,1,1,0,0,0,0,0, 0,1,1,1,1,1,1,1,1,0,0,0,0, 0,1,1,0,0,0,0,1,1,1,0,0,0, 0,1,1,0,0,0,0,0,1,1,0,0,0, 0,1,1,0,0,0,0,0,1,1,0,0,0, 0,1,1,1,0,0,0,1,1,1,0,0,0, 0,0,1,1,1,1,1,1,1,0,0,0,0, 0,0,0,1,1,1,1,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0, ]Looking at it from a distance, you can still distinguish it if you squint.
Because the verification code of the Academic Affairs Office of Dongda University is very regular, and the position of each number is fixed, so there is no need to involve any machine learning algorithm. It is just a simple matrix comparison. Just find the matrix with the highest similarity among all the implemented matrices. There are various comparison methods here. Anyway, as long as the data is simple and can be correctly identified.