search
HomeBackend DevelopmentPHP Tutorial3 ways to implement data collection in PHP

3 ways to implement data collection in PHP

Mar 27, 2018 am 11:56 AM
phpaccomplishdata collection

What is collection? It is to use PHP programs to capture information from other websites into our own database and website. This article mainly shares with you 3 methods of data collection in PHP, hoping to help everyone.

PHP production and collection technology:

From the bottom socket to the high-level file operation function, there are three methods to achieve collection.

1. Use socket technology to collect:

Socket collection is the lowest level. It just establishes a long connection, and then we have to construct the http protocol string ourselves to send the request.

For example, if you want to get the content of this page, http://tv.youku.com/?spm=a2hww.20023042.topNav.5~1~3!2~A, use socket to write as follows:

<?php
//连接,$error错误编号,$errstr错误的字符串,30s是连接超时时间
$fp=fsockopen("www.youku.com",80,$errno,$errstr,30);
if(!$fp) die("连接失败".$errstr);
 
//构造http协议字符串,因为socket编程是最底层的,它还没有使用http协议
$http="GET /?spm=a2hww.20023042.topNav.5~1~3!2~A HTTP/1.1\r\n";   //  \r\n表示前面的是一个命令
$http.="Host:www.youku.com\r\n";  //请求的主机
$http.="Connection:close\r\n\r\n";   // 连接关闭,最后一行要两个\r\n
 
//发送这个字符串到服务器
fwrite($fp,$http,strlen($http));
//接收服务器返回的数据
$data=&#39;&#39;;
while (!feof($fp)) {
$data.=fread($fp,4096);  //fread读取返回的数据,一次读取4096字节
}
//关闭连接
fclose($fp);
var_dump($data);
?>

The printed result is as follows, including the returned header information and the source code of the page:

2. Use curl_a set of functions

curl encapsulates the HTTP protocol into many functions, and you can directly pass the corresponding parameters, which reduces the difficulty of writing HTTP protocol strings .

Prerequisite: The curl extension must be enabled in php.ini.

//生成一个curl对象
$curl=curl_init();
//设置URL和相应的选项
curl_setopt($curl, CURLOPT_URL, "http://www.youku.com");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);  //将curl_exec()获取的信息以字符串返回,而不是直接输出。
//执行curl操作
$data=curl_exec($curl);
var_dump($data);

The printed result is as follows, including only the source code of the page:

3. Use it directly file_get_contents (top-level)

Prerequisite: Set the url address that allows opening a network in php.ini.

//使用file_get_contents()
$data=file_get_contents("http://www.youku.com");
var_dump($data);


Selection of 3 methods

The above three methods are mainly used for communication between networks. The latter two are more commonly used: If you want to collect a large amount of data in batches, use the second [CURL], which has good performance and stability.

Use the third method when sending a few requests occasionally but not frequently and intensively.

Extension: How to break the anti-leeching of pictures?

For example, the pictures on the 7060 website are protected from hotlinking: the pictures can be seen on his website, but cannot be accessed outside the site.

Principle: There is a referer item in the HTTP protocol, which represents the source address of the request. The server will determine if If this request is not from this website, it will be filtered out:

Solution: Simulate it yourself when sending HTTP Just referer:

Extension: Some data must be collected when you need to log in first. You can use the simulated trial simulation to log in. Collection under status:

a. First log in using browser. After logging in, there will be SESSIONID

in the COOKIE of the browser.

b. 发PHP发HTTP协议时,把浏览器中的SESSIONID放到PHP的HTTP协议请求里,这样就在以登录的状态发请求。

总结:所有客户端发过来的数据都可以被模拟,所以服务器上的程序必须要必要的地方过滤客户端的数据。

什么时候用以上东西?接口开发时、采集时。

二、数据采集

例如我要采集这个url里的所有美国电影的信息,

http://list.youku.com/category/show/c_96_a_%E7%BE%8E%E5%9B%BD_s_1_d_1_p_3.html

则先要知道电影所在的节点的结构,我们使用firebug查看。

 

然后开始写代码:完整代码如下

/**
 * 发一个GET请求获取数据
 */
function get($url)
{
   global $curl;
   // 配置curl中的http协议->可配置的荐可以查PHP手册中的curl_
   curl_setopt($curl, CURLOPT_URL, $url);
   curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
   curl_setopt($curl, CURLOPT_HEADER, FALSE);
   // 执行这个请求
   return curl_exec($curl);
}
 
// 生成一个curl对象
$curl = curl_init();
$url=&#39;http://list.youku.com/category/show/c_96_a_%E7%BE%8E%E5%9B%BD_s_1_d_1_p_3.html&#39;;
$data=get($url);
// 匹配电影所在位置
$list_preg = &#39;/<li class="yk-col4 mr1">.+<\/li>/Us&#39;;
// 匹配img标签上的src和alt
$img_preg = &#39;/<img class="quic lazy"  src="/static/imghwm/default1.png"  data-src="(.*)"   _  alt="(.*)" \/>/U&#39;;
//匹配电影的url
$video_preg=&#39;/<a href="(.*)" title="(.*)" target="(.*)"><\/a>/U&#39;;
//把所有的li存到$list里,$list是个二维数组
preg_match_all($list_preg,$data,$list);
   //var_dump($list);
foreach ($list[0] as $k => $v) {   //这里$v就是每一个li标签
/* 获取图片及电影名称
    preg_match($img_preg,$v,$img);  //把匹配到的图片的信息存到$img里
    var_dump($img);
    */
    /*获取电影地址
    preg_match($video_preg,$v,$video);  //把匹配到的电影的信息存到$video里
    var_dump($video);
*/
    preg_match($img_preg,$v,$img);
    preg_match($video_preg,$v,$video);
    echo $img[0].&#39;<a href="&#39;.$video[1].&#39;">&#39;.$video[2].&#39;</a>&#39;;
}

测试:

打印$list;

 

打印$img

 

打印$video

 

 

最终效果:

 

如果需要把图片拷贝到硬盘上,则在foreach循环里加上以下代码:

 $imgData = get($img[1]);
    // 把图片文件写到硬盘上【下载】
    // 因为操作系统是GBK的,所以要把UTF8转成GBK
    is_dir(&#39;./youkuimg/&#39;) ? &#39;&#39;: mkdir(&#39;./youkuimg/&#39;);
	file_put_contents(&#39;./youkuimg/&#39;.mb_convert_encoding($img[3], &#39;gbk&#39;, &#39;utf-8&#39;).&#39;.jpg&#39;, $imgData);


 

效果如下:在当前目录下的youkuimg目录下就会有下载好的图片。



my github: https://github.com/lensh

 

The above is the detailed content of 3 ways to implement data collection in PHP. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
PHP: An Introduction to the Server-Side Scripting LanguagePHP: An Introduction to the Server-Side Scripting LanguageApr 16, 2025 am 12:18 AM

PHP is a server-side scripting language used for dynamic web development and server-side applications. 1.PHP is an interpreted language that does not require compilation and is suitable for rapid development. 2. PHP code is embedded in HTML, making it easy to develop web pages. 3. PHP processes server-side logic, generates HTML output, and supports user interaction and data processing. 4. PHP can interact with the database, process form submission, and execute server-side tasks.

PHP and the Web: Exploring its Long-Term ImpactPHP and the Web: Exploring its Long-Term ImpactApr 16, 2025 am 12:17 AM

PHP has shaped the network over the past few decades and will continue to play an important role in web development. 1) PHP originated in 1994 and has become the first choice for developers due to its ease of use and seamless integration with MySQL. 2) Its core functions include generating dynamic content and integrating with the database, allowing the website to be updated in real time and displayed in personalized manner. 3) The wide application and ecosystem of PHP have driven its long-term impact, but it also faces version updates and security challenges. 4) Performance improvements in recent years, such as the release of PHP7, enable it to compete with modern languages. 5) In the future, PHP needs to deal with new challenges such as containerization and microservices, but its flexibility and active community make it adaptable.

Why Use PHP? Advantages and Benefits ExplainedWhy Use PHP? Advantages and Benefits ExplainedApr 16, 2025 am 12:16 AM

The core benefits of PHP include ease of learning, strong web development support, rich libraries and frameworks, high performance and scalability, cross-platform compatibility, and cost-effectiveness. 1) Easy to learn and use, suitable for beginners; 2) Good integration with web servers and supports multiple databases; 3) Have powerful frameworks such as Laravel; 4) High performance can be achieved through optimization; 5) Support multiple operating systems; 6) Open source to reduce development costs.

Debunking the Myths: Is PHP Really a Dead Language?Debunking the Myths: Is PHP Really a Dead Language?Apr 16, 2025 am 12:15 AM

PHP is not dead. 1) The PHP community actively solves performance and security issues, and PHP7.x improves performance. 2) PHP is suitable for modern web development and is widely used in large websites. 3) PHP is easy to learn and the server performs well, but the type system is not as strict as static languages. 4) PHP is still important in the fields of content management and e-commerce, and the ecosystem continues to evolve. 5) Optimize performance through OPcache and APC, and use OOP and design patterns to improve code quality.

The PHP vs. Python Debate: Which is Better?The PHP vs. Python Debate: Which is Better?Apr 16, 2025 am 12:03 AM

PHP and Python have their own advantages and disadvantages, and the choice depends on the project requirements. 1) PHP is suitable for web development, easy to learn, rich community resources, but the syntax is not modern enough, and performance and security need to be paid attention to. 2) Python is suitable for data science and machine learning, with concise syntax and easy to learn, but there are bottlenecks in execution speed and memory management.

PHP's Purpose: Building Dynamic WebsitesPHP's Purpose: Building Dynamic WebsitesApr 15, 2025 am 12:18 AM

PHP is used to build dynamic websites, and its core functions include: 1. Generate dynamic content and generate web pages in real time by connecting with the database; 2. Process user interaction and form submissions, verify inputs and respond to operations; 3. Manage sessions and user authentication to provide a personalized experience; 4. Optimize performance and follow best practices to improve website efficiency and security.

PHP: Handling Databases and Server-Side LogicPHP: Handling Databases and Server-Side LogicApr 15, 2025 am 12:15 AM

PHP uses MySQLi and PDO extensions to interact in database operations and server-side logic processing, and processes server-side logic through functions such as session management. 1) Use MySQLi or PDO to connect to the database and execute SQL queries. 2) Handle HTTP requests and user status through session management and other functions. 3) Use transactions to ensure the atomicity of database operations. 4) Prevent SQL injection, use exception handling and closing connections for debugging. 5) Optimize performance through indexing and cache, write highly readable code and perform error handling.

How do you prevent SQL Injection in PHP? (Prepared statements, PDO)How do you prevent SQL Injection in PHP? (Prepared statements, PDO)Apr 15, 2025 am 12:15 AM

Using preprocessing statements and PDO in PHP can effectively prevent SQL injection attacks. 1) Use PDO to connect to the database and set the error mode. 2) Create preprocessing statements through the prepare method and pass data using placeholders and execute methods. 3) Process query results and ensure the security and performance of the code.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.