Home  >  Article  >  Backend Development  >  Use PHP to simulate login and crawl websites that require login to access.

Use PHP to simulate login and crawl websites that require login to access.

WBOY
WBOYOriginal
2023-06-13 12:21:172414browse

With the development of the Internet, more and more websites require login to access their data. This becomes a challenge for some programmers or researchers who need to use this data. This article will introduce how to use PHP to simulate login and crawl websites that require login to access.

What is simulated login?

Simulated login refers to not using a browser to log in manually, but simulating the login operation through code to obtain the data after login. This can save a lot of time and effort in situations where frequent login access is required.

Steps to use PHP to simulate login

Before starting to use PHP to simulate login, we need to understand some basic concepts and steps.

  1. Get the login page

First, we need to get the URL address of the login page. We can use the browser's developer tools to view the action and method attributes of the login form. These attributes tell us the destination and method of form submission. We can also access the login page directly in the browser, and then obtain relevant information about the login form by viewing the page source code.

  1. Analyze the login form

Next, we need to analyze each field in the login form. By looking at the name attribute of the form element, we can determine what data needs to be submitted in the form. In order to log in successfully, we need to clarify the fields that need to be submitted and their corresponding values.

  1. Send login request

Before submitting the login form, we need to create an HTTP request. We can use PHP's curl function to simulate the browser sending an HTTP request, and at the same time pass the login form data to the server as POST parameters. Here, we need to pay attention to some special request header information, such as User-Agent and Referer.

  1. Verify login result

Finally, we need to verify whether the login is successful. You can determine whether the login was successful by checking the HTTP response code. Generally, if the login is successful, the server will return a 302 status code and redirect to the page we want to access. If the login fails, the server will return a 401 (Unauthorized) or 403 (Forbidden) status code.

Specific operation

With the understanding of the above basic concepts, we can start the actual operation.

  1. Get the login page

Let’s take the Zhihu website as an example. First, we need to get the URL of the login page.

$url = 'https://www.zhihu.com/signin';
  1. Analyzing the login form

Next, we need to analyze the login form of Zhihu. You can view the name attribute of the form element through the browser developer tools.

<input type="text" name="username" />
<input type="password" name="password" />
<input type="hidden" name="_xsrf" value="xxxxxx" />

By looking at the above code, we can know that the fields that need to be submitted in the login form include username and password, as well as a random string _xsrf. This random string is added to prevent CSRF attacks.

  1. Send login request

With the above information, we can construct an HTTP request to simulate the login operation.

$url = 'https://www.zhihu.com/login/phone_num';
$data = array(
    'phone_num' => 'your_phone_number',
    'password' => 'your_password',
    '_xsrf' => 'xxxxxx'
);

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($data));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
    'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36', 
    'Referer: https://www.zhihu.com/signin'
));
$response = curl_exec($ch);
curl_close($ch);

echo $response;

In the above code, we use the curl function to construct a POST request, including the data to be submitted, request header information and cookie information. Among them, COOKIEJAR and COOKIEFILE are used to save our cookie information for later use when accessing pages that require login. For HTTP request header disguise, you can find it in the developer tools.

  1. Verify login result

If the login is successful, the server should redirect us to the homepage or other pages that require login to access. We can determine whether the login is successful or not by looking at the HTTP response code.

$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if($http_code == 302) {
    echo '登录成功!';
} else {
    echo '登录失败!';
}

Summary

This article introduces how to use PHP to simulate login and crawl websites that require login to access. It should be noted that there are some risks in simulated login, such as privacy leakage, blocked IP, etc. Therefore, when using it, we need to fully understand the crawler strategy of the target website, comply with relevant laws and regulations, and protect our own privacy and rights.

The above is the detailed content of Use PHP to simulate login and crawl websites that require login to access.. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn