Home  >  Article  >  Backend Development  >  How to read the source code of the redirected web page in PHP

How to read the source code of the redirected web page in PHP

PHPz
PHPzOriginal
2023-03-31 09:05:091548browse

PHP is a widely used server-side scripting language that helps developers create dynamic web applications. However, sometimes PHP developers need to read the source code of an external web page, which may be a jump link. In this article, we will learn how to use PHP to read the source code of a redirect link.

Note: In this article, we will assume that you are already familiar with the PHP language and have a basic understanding of HTML and HTTP protocols.

Step 1: Open the link using cURL

cURL is a library used to process URLs in PHP. In order to read the source code of the linked web page, we need to use cURL to open the link. The following is the basic code for using cURL to open a web page in PHP:

$url = 'http://www.example.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);

In the above code, we first define the link address of the web page to be read, then create a cURL handle and set the access link option, sent a cURL request and obtained the response result. The result is saved in the $output variable.

Step 2: Handle jump links

In some cases, the link we open may be a jump link, which means it will redirect to another link. In order to obtain the source code of the redirected web page, we need to check the response header information to determine whether there is a Location header. If it exists, it means that this is a jump link, and the redirected link address is stored in Location. We need to use cURL to open this redirected link to obtain the source code.

The following is a code example:

$url = 'http://www.example.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

if ($info['http_code'] == 301 || $info['http_code'] == 302) {
    $url = $info['redirect_url'];
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
    curl_close($ch);
}

In the above code, we added a curl_setopt option: CURLOPT_FOLLOWLOCATION. This option tells cURL to follow redirects and automatically open new links. Then, we obtain the response header information and determine whether there is redirection information. If it exists, we use the curl_init() function to create a new cURL handle, open the redirect link, and obtain the source code.

Step Three: Parse the Source Code

After obtaining the source code of the web page, we need to further parse it so that we can process the data. We can use PHP's built-in DOMDocument class to parse HTML documents.

The following is a code example:

$url = 'http://www.example.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

if ($info['http_code'] == 301 || $info['http_code'] == 302) {
    $url = $info['redirect_url'];
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
    curl_close($ch);
}

$doc = new DOMDocument();
@$doc->loadHTML($output);
$elements = $doc->getElementsByTagName('html');
$title = $doc->getElementsByTagName('title')->item(0)->nodeValue;

In the above code, we first create a DOMDocument object, and then call the loadHTML() function to pass in the obtained web page source code as a parameter. Next, we use the getElementsByTagName() function to get the specified element and the nodeValue attribute to get the text content of the element. In this example, we get the HTML element and title element.

Step 4: Process the data

Finally, we can process the obtained data and store or display it as needed. The following is a simple example:

$url = 'http://www.example.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

if ($info['http_code'] == 301 || $info['http_code'] == 302) {
    $url = $info['redirect_url'];
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
    curl_close($ch);
}

$doc = new DOMDocument();
@$doc->loadHTML($output);
$title = $doc->getElementsByTagName('title')->item(0)->nodeValue;
echo "源码标题是:" . $title . "\n";
echo "HTML源码是:" . $output;

In the above code, we first get the title of the web page, and then directly output the HTML source code.

Conclusion

In this article, we learned how to use PHP to read the source code of the redirected web page. By using cURL to open links, process jump links, parse HTML documents and process data, we can easily read the source code of the web page for jump links. This is a very useful skill when you need to use web crawlers, data analysis, data mining and other scenarios.

The above is the detailed content of How to read the source code of the redirected web page in PHP. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn