Home  >  Article  >  Backend Development  >  How to use Python regular expressions for crawling and anti-crawling

How to use Python regular expressions for crawling and anti-crawling

WBOY
WBOYOriginal
2023-06-23 09:19:42611browse

In the process of crawling, we often encounter anti-crawling mechanisms, which requires us to use some tools and techniques to bypass these obstacles. Among them, regular expressions are a very important tool, which can help us perform data matching and processing in crawlers. Below, we will introduce how to use Python regular expressions for crawling and anti-crawling.

  1. Understand regular expressions

Regular expression is a tool used to describe text patterns. It can describe the target string through some specific symbols and words. specific mode. In Python, we can use the re module to manipulate regular expressions.

For example, if we want to match a phone number (in the format xxx-xxxx-xxxx), then we can use the following regular expression:

import re

regex = re.compile(r'd{3}-d{4}-d{4}')

In this regular expression, d means matching numbers, {3} means matching 3 numbers, {4} means matching 4 numbers, - means matching hyphens . Through this regular expression, we can match phone numbers that match the pattern.

  1. Crawling web content

Before anti-crawling, we first need to crawl the content of the target website. In Python, we can use the requests library to obtain web page content. For example, if we want to get the ranking page of Maoyan movies, we can use the following code:

import requests

url = 'https://maoyan.com/board'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)

html = response.text

Among them, the headers parameter is a forged request header, which can help us bypass some anti-crawler mechanisms. . response.text represents the content of the obtained web page. Now we have obtained the source code of the target web page.

  1. Use regular expressions for data processing

After getting the web page source code, we need to use regular expressions to extract the information we need. Taking the Maoyan movie rankings as an example, we want to get the names and release times of all movies in the rankings. By looking at the source code, we can find that this information is in the following HTML tags:

<dd>
    <div class="movie-item-info">
        <p class="name"><a href="/films/1211269" title="误杀" data-act="boarditem-click" data-val="{movieId:1211269}">误杀</a></p>
        <p class="star">
                主演:肖央,谭卓,钟南山
        </p>
<p class="releasetime">上映时间:2020-12-04</p>    </div>
    <div class="movie-item-number score-num">
        <p class="score"><i class="integer">9.</i><i class="fraction">7</i></p>        
    </div>
</dd>

We can use the following regular expression to match the movie name and release time:

pattern = re.compile('<p class="name"><a href="/films/d+" title="(.*?)" data-act="boarditem-click".*?<p class="releasetime">(.*?)</p>', re.S)

This In regular expressions, .*? represents non-greedy matching, that is, only matching necessary text content. re.S means that . can match any character, including newline characters. Now we've built a regular expression that matches movie titles and release times.

Next, we can use the findall method of regular expressions to extract the matching results:

movies = re.findall(pattern, html)

This operation will return a list, each element of which They are all a tuple, representing the movie name and release time respectively. Now we have successfully crawled all the movie information in the Maoyan movie ranking page.

  1. Simulate user behavior

Before anti-crawling, we need to understand some common anti-crawling methods used by websites, such as setting access frequency limits, IP blocking, etc. In order to avoid these anti-crawler mechanisms, we need to simulate the normal behavior of users. For example, when crawling the Maoyan movie ranking page, we can set a random time interval to simulate the behavior of humans browsing the web:

import time

interval = random.uniform(0, 3)
time.sleep(interval)

In this code snippet, random.uniform(0, 3) means generating a random number between 0 and 3, time.sleep(interval) means letting the program wait for the corresponding time.

  1. Cracking dynamically loaded pages

Some websites use dynamic loading technology in their pages, which means they need to use JavaScript and other scripting languages ​​to dynamically generate page content. If we directly use the requests library to obtain this kind of page, we can only obtain static HTML code and cannot obtain dynamically generated content. At this time, we can use the Selenium library to simulate human operations so that the browser can load the page content normally. For example, if we want to get the comment page of Weibo, we can use the following code:

from selenium import webdriver

url = 'https://weibo.com/xxxxxx'

browser = webdriver.Firefox()
browser.get(url)

time.sleep(10)

html = browser.page_source

Through the above code, we can get the complete page content, including the comment area generated by dynamic loading.

Summary

This article introduces how to use Python regular expressions to crawl and anti-crawl. The main content includes:

  1. Understand regular expressions;
  2. Crawl web content;
  3. Use regular expressions for data matching;
  4. Simulate user behavior;
  5. Crack dynamically loaded pages.

I hope these tips can help you better crawl and anti-crawl and obtain more target data.

The above is the detailed content of How to use Python regular expressions for crawling and anti-crawling. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn