Home >Backend Development >PHP Tutorial >PHP crawler practice: crawling Baidu search results
With the development of the Internet, we can easily obtain various information through various search engines. For developers, how to obtain various data from search engines is a very important skill. Today, we will learn how to use PHP to write a crawler to crawl Baidu search results.
1. Working principle of crawler
Before we begin, let’s first understand the basic principles of crawler working.
2. The process of crawling Baidu search results
First, we need to construct the request URL based on the keywords. Taking the search for "PHP crawler" as an example, the request URL is: https://www.baidu.com/s?ie=UTF-8&wd=PHP crawler
Among them, ie=UTF-8 means using UTF- 8 Coding method; wd= followed by search keywords.
In PHP, we can use the cURL library to send HTTP requests. The specific implementation code is as follows:
<?php function curl_request($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADER, 0); $output = curl_exec($ch); curl_close($ch); return $output; } $url = 'https://www.baidu.com/s?ie=UTF-8&wd=PHP%20%E7%88%AC%E8%99%AB'; $html = curl_request($url); echo $html; ?>
Here, we use the curl_request() function to send a request and obtain the page content.
Next, we need to use regular expressions to parse the page content and extract the data needed for the search results. We can use the browser's developer tools to view the page source code, find the HTML elements corresponding to the required data, and then use regular expressions to match.
For example, if we want to get the title and link of the search results, we can find the following code from the page source code:
<h3 class="t"><a href="链接地址" target="_blank">标题</a></h3>
We can use the following regular expression to match:
$pattern = '/<h3 class="t"><a([sS]*?)href="(.*?)"[sS]*?>([sS]*?)</a></h3>/'; preg_match_all($pattern, $html, $matches);
Here, we use the preg_match_all() function to implement regular expression matching and save the matching results in the $matches variable.
Finally, we will output the extracted search results to get the data we want. The specific implementation code is as follows:
$url) { echo ($key + 1) . '、' . $matches[3][$key] . '
'; } ?>
Here, we use a foreach loop to traverse the matched links and titles, and output the results to the page.
3. Summary
Through the introduction of this article, we understand the basic principles of PHP crawlers and how to use PHP to crawl Baidu search results. At the same time, we have also noticed that the use of crawlers requires attention to legal, ethical and other issues. We need to abide by relevant regulations and not conduct crawling operations that are illegal or illegal.
The above is the detailed content of PHP crawler practice: crawling Baidu search results. For more information, please follow other related articles on the PHP Chinese website!