Home > Article > Backend Development > How to use Python regular expressions for crawling and anti-crawling
In the process of crawling, we often encounter anti-crawling mechanisms, which requires us to use some tools and techniques to bypass these obstacles. Among them, regular expressions are a very important tool, which can help us perform data matching and processing in crawlers. Below, we will introduce how to use Python regular expressions for crawling and anti-crawling.
Regular expression is a tool used to describe text patterns. It can describe the target string through some specific symbols and words. specific mode. In Python, we can use the re module to manipulate regular expressions.
For example, if we want to match a phone number (in the format xxx-xxxx-xxxx), then we can use the following regular expression:
import re regex = re.compile(r'd{3}-d{4}-d{4}')
In this regular expression, d
means matching numbers, {3}
means matching 3 numbers, {4}
means matching 4 numbers, -
means matching hyphens . Through this regular expression, we can match phone numbers that match the pattern.
Before anti-crawling, we first need to crawl the content of the target website. In Python, we can use the requests library to obtain web page content. For example, if we want to get the ranking page of Maoyan movies, we can use the following code:
import requests url = 'https://maoyan.com/board' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3' } response = requests.get(url, headers=headers) html = response.text
Among them, the headers
parameter is a forged request header, which can help us bypass some anti-crawler mechanisms. . response.text
represents the content of the obtained web page. Now we have obtained the source code of the target web page.
After getting the web page source code, we need to use regular expressions to extract the information we need. Taking the Maoyan movie rankings as an example, we want to get the names and release times of all movies in the rankings. By looking at the source code, we can find that this information is in the following HTML tags:
<dd> <div class="movie-item-info"> <p class="name"><a href="/films/1211269" title="误杀" data-act="boarditem-click" data-val="{movieId:1211269}">误杀</a></p> <p class="star"> 主演:肖央,谭卓,钟南山 </p> <p class="releasetime">上映时间:2020-12-04</p> </div> <div class="movie-item-number score-num"> <p class="score"><i class="integer">9.</i><i class="fraction">7</i></p> </div> </dd>
We can use the following regular expression to match the movie name and release time:
pattern = re.compile('<p class="name"><a href="/films/d+" title="(.*?)" data-act="boarditem-click".*?<p class="releasetime">(.*?)</p>', re.S)
This In regular expressions, .*?
represents non-greedy matching, that is, only matching necessary text content. re.S
means that .
can match any character, including newline characters. Now we've built a regular expression that matches movie titles and release times.
Next, we can use the findall
method of regular expressions to extract the matching results:
movies = re.findall(pattern, html)
This operation will return a list, each element of which They are all a tuple, representing the movie name and release time respectively. Now we have successfully crawled all the movie information in the Maoyan movie ranking page.
Before anti-crawling, we need to understand some common anti-crawling methods used by websites, such as setting access frequency limits, IP blocking, etc. In order to avoid these anti-crawler mechanisms, we need to simulate the normal behavior of users. For example, when crawling the Maoyan movie ranking page, we can set a random time interval to simulate the behavior of humans browsing the web:
import time interval = random.uniform(0, 3) time.sleep(interval)
In this code snippet, random.uniform(0, 3)
means generating a random number between 0 and 3, time.sleep(interval)
means letting the program wait for the corresponding time.
Some websites use dynamic loading technology in their pages, which means they need to use JavaScript and other scripting languages to dynamically generate page content. If we directly use the requests library to obtain this kind of page, we can only obtain static HTML code and cannot obtain dynamically generated content. At this time, we can use the Selenium library to simulate human operations so that the browser can load the page content normally. For example, if we want to get the comment page of Weibo, we can use the following code:
from selenium import webdriver url = 'https://weibo.com/xxxxxx' browser = webdriver.Firefox() browser.get(url) time.sleep(10) html = browser.page_source
Through the above code, we can get the complete page content, including the comment area generated by dynamic loading.
Summary
This article introduces how to use Python regular expressions to crawl and anti-crawl. The main content includes:
I hope these tips can help you better crawl and anti-crawl and obtain more target data.
The above is the detailed content of How to use Python regular expressions for crawling and anti-crawling. For more information, please follow other related articles on the PHP Chinese website!