Home > Article > Backend Development > How to crawl Lianjia rental information with PHP
In today's era, as people's demand for renting a house continues to increase, the emergence of various real estate information websites, such as Lianjia.com, 58.com, etc., have also developed rapidly. For renters, it is very important to quickly obtain rental information. In this case, writing a PHP crawler to crawl Lianjia rental information is an efficient and convenient solution.
This article will introduce a simple and easy-to-understand PHP method to crawl Lianjia rental information, so that everyone can quickly obtain and integrate the required information to better find the rental information that they are satisfied with.
1. Crawl the website source code
First of all, for the crawler, the most important thing is to obtain the source code of the target web page. Therefore, we need to use PHP’s cURL function to obtain the source code of the Lianjia Rental homepage. The specific code is as follows:
$url = "https://sz.lianjia.com/zufang/"; // 链家租房首页网址 $ch = curl_init(); //初始化curl curl_setopt($ch, CURLOPT_URL, $url); //设置爬取网页url curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);//不直接输出网页内容 $data = curl_exec($ch);//执行curl curl_close($ch); echo $data;//输出网页源代码
The above code uses the curl_init() function to initialize curl. The curl_setopt() function sets the target webpage URL that needs to be obtained, and does not directly output the webpage content, but stores it in $ data variable. Then use the curl_exec() function to execute curl and obtain the web page source code. Finally, use the curl_close() function to close curl.
2. Analyze the web page source code
After successfully obtaining the source code of the Lianjia rental homepage, we need to analyze it to find the required rental information. During analysis, regular expressions need to be used to match the required information.
In the source code of the Lianjia rental homepage, we can find that the rental information is contained in divs with class "content__list--item", and each rental information is an independent div, so we can use Regular expression to match these divs. The specific regular expression is as follows:
$preg = '/<div class="content__list--item".*?>.*?<div class="content__list--item--main">.*?<span class="content__list--item-price"><em>(.*?)</em>元/月</span>.*?<a.*?>(.*?)</a>.*?<span class="content__list--item--des">(.*?)</span>.*?<i>(.*?)</i>.*?</div>.*?</div>/si'; //匹配div,获取每个信息的价格、标题、描述、地区
In the above regular expression, we matched the div tag containing rental information, and used a specific regular expression to match the div tag containing price, title, description and region information. Other div tags or elements. Among them, the si mode modifier is used to facilitate matching of multiple lines of text.
3. Parse the web page source code
After using regular expressions to match the divs where all the rental information is located, we need to further parse and analyze the specific information contained in each rental information, such as rent , address, etc. Here, we can use PHP’s DOMDocument class to manipulate HTML tags.
The specific code for using the DOMDocument class to parse HTML tags is as follows:
$dom = new DOMDocument(); $dom->loadHTML($data); $domxpath = new DOMXPath($dom); $element = $domxpath->query('//div[@class="content__list--item"]'); foreach($element as $el){ //在这里做具体解析操作 }
In the above code, we first use the DOMDocument class to load the obtained web page source code into the DOM, and use the DOMXPath class to perform xpath queries on the DOM. Then, use the query() function to query the div elements where all rental information is located, and use the foreach() function to traverse the div elements where each rental information is located.
4. Extract the required information
After traversing the div where each rental information is located, we need to further use regular expressions to extract the required information, such as price, address, etc. . The specific code is as follows:
//提取价格 $price = $domxpath->query('.//span[@class="content__list--item-price"]/em',$el)->item(0)->nodeValue; //提取标题 $title = $domxpath->query('.//a',$el)->item(0)->nodeValue; //提取描述 $desc = $domxpath->query('.//span[@class="content__list--item--des"]',$el)->item(0)->nodeValue; //提取地区 $region = $domxpath->query('.//i',$el)->item(0)->nodeValue;
In the above code, we use the query() function to query the HTML element node of the required information from the div element where each rental information is located; use the item() function to select the first element in the node list, and then use the nodeValue property to get the text content of that element.
5. Integrate the required information
Finally, we integrate all the required information into an associative array.
$info = ['price'=>$price, 'title'=>$title, 'desc'=>$desc, 'region'=>$region];
Next, we add the integrated information to an array, and output the entire array after traversing all the div elements where the rental information is located.
$result[] = $info;// 将每个房屋信息数组添加到$result数组 } print_r($result);//输出所有租房信息数组
Through the above operations, we can easily obtain all relevant information on the Lianjia rental website, thus bringing great convenience to our renting.
Summary
Through the introduction of this article, I believe everyone can easily master the method of crawling Lianjia rental information with PHP. Specifically, we need to use the cURL function to obtain the source code of the web page, use regular expressions to match the HTML elements where the required information is located, use the DOMDocument class to parse HTML tags, and finally integrate the required information into an associative array , and output the entire array to obtain the final required rental information.
The above is the detailed content of How to crawl Lianjia rental information with PHP. For more information, please follow other related articles on the PHP Chinese website!