


A brief discussion on crawlers and bypassing website anti-crawling mechanisms
【Related learning recommendations: Website production video tutorial】
What is a crawler? To put it simply and one-sidedly, a crawler is a tool that allows a computer to automatically interact with a server to obtain data. The most basic thing of a crawler is to get the source code data of a web page. If you go deeper, you will have POST interaction with the web page and get the data returned by the server after receiving the POST request. In a word, crawlers are used to automatically obtain source data. As for more data processing, etc., they are follow-up work. This article mainly wants to talk about this part of crawlers obtaining data. Crawlers, please pay attention to the Robot.txt file of the website. Do not let crawlers break the law or cause harm to the website.
Inappropriate examples of anti-crawling and anti-anti-crawling concepts
Due to many reasons (such as server resources, data protection, etc.), many websites limit the effectiveness of crawlers .
Think about it, if a human plays the role of a crawler, how do we obtain the source code of a web page? The most commonly used method is of course right-clicking the source code.
The website blocks the right click, what should I do?
Take out the most useful thing we use in crawling, F12 (welcome to discuss)
Press F12 at the same time to open it (funny)
The source code is out!!
When treating people as crawlers, block the right click It is the anti-crawling strategy, and F12 is the anti-crawling method.
Let’s talk about the formal anti-crawling strategy
In fact, in the process of writing a crawler, there must be a situation where no data is returned. In this case, the server may Limiting the UA header (user-agent), this is a very basic anti-crawling. Just add the UA header when sending the request... Isn't it very simple? Adding all the required Request Headers is a simple and crude method...
Have you ever discovered that the verification code of the website is also an anti-crawling strategy? In order to ensure that the users of the website are real people, the verification code is really done A great contribution. Along with the verification code, verification code recognition appeared.
Speaking of which, I wonder whether verification code recognition or image recognition came first?
It is very simple to recognize simple verification codes now. There are too many tutorials on the Internet, including a little Advanced concepts such as denoising, binary, segmentation, and reorganization. But now website human-machine recognition has become more and more terrifying, such as this:
Let’s briefly talk about the concept of denoising binary values
Will a verification The code
becomes
which is a binary value, that is, changing the picture itself into only two tones, example It's very simple. It can be achieved through
Image.convert("1")
in the python PIL library. However, if the image becomes more complex, you still need to think more, such as
If you use the simple method directly, it will become
Think about how to identify this verification code? This At this time, denoising comes in handy. Based on the characteristics of the verification code itself, the background color of the verification code and the RGB values other than the font can be calculated, and these values can be turned into a color, leaving the fonts alone. The sample code is as follows, just change the color
for x in range(0,image.size[0]): for y in range(0,image.size[1]): # print arr2[x][y] if arr[x][y].tolist()==底色: arr[x][y]=0 elif arr[x][y].tolist()[0] in range(200,256) and arr[x][y].tolist()[1] in range(200,256) and arr[x][y].tolist()[2] in range(200,256): arr[x][y]=0 elif arr[x][y].tolist()==[0,0,0]: arr[x][y]=0 else: arr[x][y]=255
Arr is obtained by numpy. The matrix is obtained based on the RGB values of the image. Readers can try to improve the code and experiment for themselves.
After careful processing, the picture can become
The recognition rate is still very high.
In the development of verification codes, there are wheels available online for fairly clear numbers and letters, simple addition, subtraction, multiplication and division. For some difficult numbers, letters and Chinese characters, you can also make your own wheels (such as the above), but there are more Things are enough to write an artificial intelligence... (One kind of job is to recognize verification codes...)
Add a little tip: Some websites have verification codes on the PC side, but not on the mobile phone side...
Next topic!
One of the more common anti-crawling strategies is the IP blocking strategy. Usually too many visits in a short period of time will be blocked. This It's very simple. It's OK to limit the access frequency or add an IP proxy pool. Of course, it can also be distributed... Not used much but still ok.
Another kind of anti-crawler strategy is asynchronous data. With the gradual deepening of crawlers (it is obviously an update of the website!), asynchronous loading is a problem that will definitely be encountered, and the solution is still It's F12. Take the anonymous NetEase Cloud Music website as an example. After right-clicking to open the source code, try searching for comments
Where is the data?! This is asynchronous after the rise of JS and Ajax. Loaded features. But open F12, switch to the NetWork tab, refresh the page, and search carefully, there is no secret.
Oh, by the way, if you are listening to the song, you can download it by clicking in...
Only To popularize the structure of the website, please consciously resist piracy, protect copyright, and protect the interests of the original creator.
What should you do if this website restricts you? We have one last plan, an invincible combination: selenium PhantomJs
This combination is very powerful and can be perfect Simulate browser behavior. Please refer to Baidu for specific usage. This method is not recommended. It is very cumbersome. This is only for popular science.
The above is the detailed content of A brief discussion on crawlers and bypassing website anti-crawling mechanisms. For more information, please follow other related articles on the PHP Chinese website!

React is a JavaScript library for building modern front-end applications. 1. It uses componentized and virtual DOM to optimize performance. 2. Components use JSX to define, state and attributes to manage data. 3. Hooks simplify life cycle management. 4. Use ContextAPI to manage global status. 5. Common errors require debugging status updates and life cycles. 6. Optimization techniques include Memoization, code splitting and virtual scrolling.

React's future will focus on the ultimate in component development, performance optimization and deep integration with other technology stacks. 1) React will further simplify the creation and management of components and promote the ultimate in component development. 2) Performance optimization will become the focus, especially in large applications. 3) React will be deeply integrated with technologies such as GraphQL and TypeScript to improve the development experience.

React is a JavaScript library for building user interfaces. Its core idea is to build UI through componentization. 1. Components are the basic unit of React, encapsulating UI logic and styles. 2. Virtual DOM and state management are the key to component work, and state is updated through setState. 3. The life cycle includes three stages: mount, update and uninstall. The performance can be optimized using reasonably. 4. Use useState and ContextAPI to manage state, improve component reusability and global state management. 5. Common errors include improper status updates and performance issues, which can be debugged through ReactDevTools. 6. Performance optimization suggestions include using memo, avoiding unnecessary re-rendering, and using us

Using HTML to render components and data in React can be achieved through the following steps: Using JSX syntax: React uses JSX syntax to embed HTML structures into JavaScript code, and operates the DOM after compilation. Components are combined with HTML: React components pass data through props and dynamically generate HTML content, such as. Data flow management: React's data flow is one-way, passed from the parent component to the child component, ensuring that the data flow is controllable, such as App components passing name to Greeting. Basic usage example: Use map function to render a list, you need to add a key attribute, such as rendering a fruit list. Advanced usage example: Use the useState hook to manage state and implement dynamics

React is the preferred tool for building single-page applications (SPAs) because it provides efficient and flexible ways to build user interfaces. 1) Component development: Split complex UI into independent and reusable parts to improve maintainability and reusability. 2) Virtual DOM: Optimize rendering performance by comparing the differences between virtual DOM and actual DOM. 3) State management: manage data flow through state and attributes to ensure data consistency and predictability.

React is a JavaScript library developed by Meta for building user interfaces, with its core being component development and virtual DOM technology. 1. Component and state management: React manages state through components (functions or classes) and Hooks (such as useState), improving code reusability and maintenance. 2. Virtual DOM and performance optimization: Through virtual DOM, React efficiently updates the real DOM to improve performance. 3. Life cycle and Hooks: Hooks (such as useEffect) allow function components to manage life cycles and perform side-effect operations. 4. Usage example: From basic HelloWorld components to advanced global state management (useContext and

The React ecosystem includes state management libraries (such as Redux), routing libraries (such as ReactRouter), UI component libraries (such as Material-UI), testing tools (such as Jest), and building tools (such as Webpack). These tools work together to help developers develop and maintain applications efficiently, improve code quality and development efficiency.

React is a JavaScript library developed by Facebook for building user interfaces. 1. It adopts componentized and virtual DOM technology to improve the efficiency and performance of UI development. 2. The core concepts of React include componentization, state management (such as useState and useEffect) and the working principle of virtual DOM. 3. In practical applications, React supports from basic component rendering to advanced asynchronous data processing. 4. Common errors such as forgetting to add key attributes or incorrect status updates can be debugged through ReactDevTools and logs. 5. Performance optimization and best practices include using React.memo, code segmentation and keeping code readable and maintaining dependability


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Dreamweaver Mac version
Visual web development tools

WebStorm Mac version
Useful JavaScript development tools

Zend Studio 13.0.1
Powerful PHP integrated development environment