When doing data collection, we often use curl+ to collect the required data in a regular way. Based on my own work experience, I will share some common custom functions I wrote in the blog garden. If If there is something inappropriate in my writing, please give me some advice
This is a series and there is no way to finish it in one or two days, so I will publish it one by one
Rough outline:
1.curlSingle page collection function of data collection seriesget_html
2.curlMulti-page parallel collection function of data collection seriesget_htmls
3.curlRegular processing function of data collection seriesget _matches
4.curlCode separation of data collection series
5.curlParallel logic control function of data acquisition seriesweb_spider
,,,
Single page collection is the most commonly used function in the data collection process. Sometimes under server access restrictions, this collection method can only be used. It is slow but can be easily controlled, so write a commonly used curlFunction calling is very important
We are familiar with Baidu and NetEase, so we will use the collection of homepages of these two websites as examples
The simplest way to write:
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.baidu.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #800080;">$ch</span> = curl_init(<span style="color: #800080;">$url</span><span style="color: #000000;">); </span><span style="color: #008080;">3</span> curl_setopt(<span style="color: #800080;">$ch</span>,CURLOPT_RETURNTRANSFER,<span style="color: #0000ff;">true</span><span style="color: #000000;">); </span><span style="color: #008080;">4</span> curl_setopt(<span style="color: #800080;">$ch</span>,CURLOPT_TIMEOUT,5<span style="color: #000000;">); </span><span style="color: #008080;">5</span> <span style="color: #800080;">$html</span> = curl_exec(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;">6</span> <span style="color: #0000ff;">if</span>(<span style="color: #800080;">$html</span> !== <span style="color: #0000ff;">false</span><span style="color: #000000;">){ </span><span style="color: #008080;">7</span> <span style="color: #0000ff;">echo</span> <span style="color: #800080;">$html</span><span style="color: #000000;">; </span><span style="color: #008080;">8</span> }
Due to frequent use, curl_setopt_array can be used to write it in the form of a function:
<span style="color: #008080;"> 1</span> <span style="color: #0000ff;">function</span> get_html(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span> = <span style="color: #0000ff;">array</span><span style="color: #000000;">()){ </span><span style="color: #008080;"> 2</span> <span style="color: #800080;">$options</span>[CURLOPT_RETURNTRANSFER] = <span style="color: #0000ff;">true</span><span style="color: #000000;">; </span><span style="color: #008080;"> 3</span> <span style="color: #800080;">$options</span>[CURLOPT_TIMEOUT] = 5<span style="color: #000000;">; </span><span style="color: #008080;"> 4</span> <span style="color: #800080;">$ch</span> = curl_init(<span style="color: #800080;">$url</span><span style="color: #000000;">); </span><span style="color: #008080;"> 5</span> curl_setopt_array(<span style="color: #800080;">$ch</span>,<span style="color: #800080;">$options</span><span style="color: #000000;">); </span><span style="color: #008080;"> 6</span> <span style="color: #800080;">$html</span> = curl_exec(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 7</span> curl_close(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 8</span> <span style="color: #0000ff;">if</span>(<span style="color: #800080;">$html</span> === <span style="color: #0000ff;">false</span><span style="color: #000000;">){ </span><span style="color: #008080;"> 9</span> <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">false</span><span style="color: #000000;">; </span><span style="color: #008080;">10</span> <span style="color: #000000;"> } </span><span style="color: #008080;">11</span> <span style="color: #0000ff;">return</span> <span style="color: #800080;">$html</span><span style="color: #000000;">; </span><span style="color: #008080;">12</span> }
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.baidu.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>);
Sometimes you need to pass some specific parameters to get the correct page. For example, if you want to get the NetEase page now:
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>);
You will see a blank with nothing, then use curl_getinfo to write a function and see what happens:
<span style="color: #008080;"> 1</span> <span style="color: #0000ff;">function</span> get_info(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span> = <span style="color: #0000ff;">array</span><span style="color: #000000;">()){ </span><span style="color: #008080;"> 2</span> <span style="color: #800080;">$options</span>[CURLOPT_RETURNTRANSFER] = <span style="color: #0000ff;">true</span><span style="color: #000000;">; </span><span style="color: #008080;"> 3</span> <span style="color: #800080;">$options</span>[CURLOPT_TIMEOUT] = 5<span style="color: #000000;">; </span><span style="color: #008080;"> 4</span> <span style="color: #800080;">$ch</span> = curl_init(<span style="color: #800080;">$url</span><span style="color: #000000;">); </span><span style="color: #008080;"> 5</span> curl_setopt_array(<span style="color: #800080;">$ch</span>,<span style="color: #800080;">$options</span><span style="color: #000000;">); </span><span style="color: #008080;"> 6</span> <span style="color: #800080;">$html</span> = curl_exec(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 7</span> <span style="color: #800080;">$info</span> = curl_getinfo(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 8</span> curl_close(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 9</span> <span style="color: #0000ff;">return</span> <span style="color: #800080;">$info</span><span style="color: #000000;">; </span><span style="color: #008080;">10</span> <span style="color: #000000;">} </span><span style="color: #008080;">11</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">; </span><span style="color: #008080;">12</span> <span style="color: #008080;">var_dump</span>(get_info(<span style="color: #800080;">$url</span>));
You can see that http_code 302 is redirected. At this time, you need to pass some parameters:
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #800080;">$options</span>[CURLOPT_FOLLOWLOCATION] = <span style="color: #0000ff;">true</span><span style="color: #000000;">; </span><span style="color: #008080;">3</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span>);
You will find out why such a page is different from the one accessed by our computer? ? ?
It seems that the parameters are still not enough for the server to determine what device our client is on, so it returns a normal version
Looks like we have to send USERAGENT
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #800080;">$options</span>[CURLOPT_FOLLOWLOCATION] = <span style="color: #0000ff;">true</span><span style="color: #000000;">; </span><span style="color: #008080;">3</span> <span style="color: #800080;">$options</span>[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0'<span style="color: #000000;">; </span><span style="color: #008080;">4</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span>);
OKNow the page has come out. Basically thisget_htmlfunction can basically achieve such extended functions
Of course, there are other ways to achieve it. When you clearly know the NetEase webpage, you can simply collect it:
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com/index.html'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>);
This can also be collected normally
Today comes to an end byebye!!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

SublimeText3 Chinese version
Chinese version, very easy to use

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

Dreamweaver CS6
Visual web development tools
