Practical crawler combat: Use PHP to crawl JD.com product information-PHP Tutorial-php.cn

Home

Backend Development

PHP Tutorial

Practical crawler combat: Use PHP to crawl JD.com product information

PHPz

Jun 13, 2023 am 11:11 AM

phpJingdongreptile

In today’s e-commerce era, JD.com, as one of China’s largest comprehensive e-commerce companies, can even put tens of thousands of products on its shelves every day. For the majority of consumers, JD.com provides a wide range of product selections and advantageous price concessions. However, sometimes, we need to obtain JD product information in batches, quickly screen, compare, analyze, etc. At this time, we need to use crawler technology. In this article, we will introduce the implementation of using PHP language to write a crawler to help us quickly crawl JD.com product information.

Preparation

First, we need to install the curl extension required by PHP and set some commonly used variables. The specific steps are as follows:

First, open the terminal or powershell and enter the following command to install the curl extension package:

sudo apt-get install php7.0-curl //ubuntu系统安装

brew install curl-openssl php-curl //macOS系统安装

Next, we need to set some simple variables in the PHP code to facilitate us used in subsequent code. For example, we define a $jgname variable to represent the access address of JD.com, and another $skulist variable to represent the access address of each product. The code is as follows:

$jgname= "https://list.jd.com/list.html?cat=1318,1486,1490&ev=exbrand_13910&sort=sort_rank_asc&trans=1&JL=3_%E5%93%81%E7%89%8C_%E5%B0%8F%E7%B1%B3%EF%BC%88MI%EF%BC%89#J_crumbsBar";
$skulist="https://item.jd.com/1285310.html";

Get the product list

Now that we have prepared the environment and required variables, we can start writing our crawler. First, we need to obtain the product list of the target JD product page. We can use curl tools and regular expressions to obtain the target link based on the access address of the JD.com product page (i.e. $jgname). Get product information such as price, number of reviews, product name, product number, etc. respectively.

The specific code is as follows:

$ch = curl_init();//初始化curl

curl_setopt($ch, CURLOPT_URL,$jgname);//设置url属性
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);//设置是否将curl_exec()获取的信息以字符串返回，而不是直接输出
$result = curl_exec ($ch);//执行一个curl会话
curl_close ($ch);//关闭curl会话

preg_match_all("/<li .*?</li>/", $result, $matches);//正则表达式把需要的内容取出来，即匹配<li>标签

$goodsinfo=array();//创建一个商品列表

foreach ($matches[0] as $item) {
    //获取商品信息
    preg_match("/sku="(d+)"/",$item,$skuid);
    preg_match("/标题">s{0,}([dD]+?)s{0,}</a>/",$item,$titlename);
    preg_match("/<strong>￥</strong>[s
]{0,}<i>(d+.d+)</i>/",$item,$price);
    preg_match("/<divs{0,}class="p-commit">[s
]+<strong[^>]+>(d+)/",$item,$commentnum);
    preg_match("/<as{0,}href="([dD]+?)"/",$item,$link);

    //将商品信息存储到商品列表中
    $goods=array(
         "title"=>trim($titlename[1]),
         "price"=>trim($price[1]),
         "link"=>"https:".$link[1],
         "skuid"=>trim($skuid[1]),
         "commentnum"=>trim($commentnum[1])
    );
    array_push($goodsinfo,$goods);//将商品信息添加到商品列表

    //输出测试：打印商品信息
    echo $goods['title']." ".$goods['price']." ".$goods['commentnum']." ".$goods['link']."<br>";
}

In the above code, we store the link and number of each product obtained in $goods'skuid' and 'link', and Other useful information (price, number of reviews, etc.) is placed in the $goods array. Finally, it is added to the $goodsinfo array through the array_push() function. You can use loop statements to output product list information for easy viewing of crawling results.

Get product details

Now, we have obtained the product list information in the JD product table page, the next step is to obtain the detailed information of each product , and store it in the $goods array. We have obtained the number and link of each product in the $goods array in the previous step. Therefore, the next step is to open each link to obtain various useful product information. The specific code is as follows:

foreach ($goodsinfo as &$goods) {
    //更新每个商品的网页链接
    $link="https://item.jd.com/".$goods['skuid'].".html";
    $goods['link']=$link;

    $canBuy=true;//官网上可以买
    //判断是否能够购买
    preg_match('/无货/',file_get_contents($link)) && ($canBuy=false);

    //利用curl工具打开网页链接，获得网页代码
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$link);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    $html = curl_exec ($ch);
    curl_close ($ch);
    //分析网页代码，使用正则表达式获取商品种类，价格，颜色，库存数量等数据，并保存
    preg_match_all('/<divs{0,}class="Ptable".*?>[s
]+<divs{0,}class="Ptable-item".*?>[s
]+([dD]*?)</div>/',$html,$items);
    preg_match_all('/<strong>商品名称</strong><em>(d.*)</em>/',$html,$item);
    $goods['title']=$item[1][0];
    echo $goods['title'];

    if($canBuy)
    {
        foreach ($items[1] as &$item) {
            //去掉html标记、空格、换行符
            $item=strip_tags($item);
            $item=str_replace(" ","",$item); 
            $item=str_replace("    ","",$item); 
            $item=str_replace("
","",$item);
            $item=str_replace("
","",$item); 

            //切割字符串，获取键值对
            preg_match_all('/([dD]*?)：([dD]*?)[
]/',$item,$item2);
            if(count($item2[1])>0){
                for($i=0;$i<count($item2[1]);$i++){
                    if($item2[1][$i]=="价格"){
                        $goods['price']=$item2[2][$i];
                    }elseif($item2[1][$i]=="颜色"){
                        $goods['color']=$item2[2][$i];
                    }elseif($item2[1][$i]=="产地"){
                        $goods['producePlace']=$item2[2][$i];
                    }elseif($item2[1][$i]=="商品编号"){
                        $goods['goodsn']=$item2[2][$i];
                    }elseif($item2[1][$i]=="型号"){
                        $goods['model']=$item2[2][$i];
                    }elseif($item2[1][$i]=="商品毛重"){
                        $goods['grossWeight']=$item2[2][$i];
                    }elseif($item2[1][$i]=="规格"){
                        $goods['specifications']=$item2[2][$i];
                    }
                }
            }
        }
        //获取商品评论数
        preg_match_all('/<as{0,}href="#comment"s{0,}target="_self">s{0,}[dD]+?<strongs{0,}class="curr-num">(d*)</',$html,$comment);
        $goods['commentnum']=$comment[1][0];
    }
}

In these codes, we use a technique similar to step 2, using the curl tool to obtain the detailed link of each product, and then using regular expressions to obtain some useful product information . We can output the obtained product details in the following way:

foreach ($goodsinfo as &$goods) {
    echo $goods['skuid']." ".$goods['title']." ".$goods['price']." ".$goods['commentnum']." ".$goods['link']."<br>";
}

That’s it for the whole process. In actual applications, we can make some adjustments and optimizations to the code based on actual needs, such as adding exception handling, setting request headers, adjusting crawling speed, etc. In short, on this basis, a stable and efficient crawler can be built to obtain JD product information and further assist e-commerce operations and analysis.

The above is the detailed content of Practical crawler combat: Use PHP to crawl JD.com product information. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

京东官方旗舰店和京东自营旗舰店的区别是什么Oct 17, 2022 pm 02:53 PM

区别：1、京东自营是京东公司自己经营的店面，从品牌厂家进货到京东仓库，然后在京东平台销售给消费者；而京东官方旗舰店是各大品牌商家借助京东平台销售自家产品。2、京东自营使用京东物流，发货快；而京东官方旗舰店则是由品牌发货。3、京东自营的产品都是储存在京东自己的仓储中心里面，而京东官方旗舰店的商品是储备于商家自己的仓库里面。4、京东自营的模式是B2B和B2C，而官方旗舰店是B2C。

淘宝和京东有什么区别Aug 18, 2022 pm 05:47 PM

区别：1、淘宝网是C2C网购平台，而京东是B2C平台。2、京东采用得是价值链整合模式，淘宝则采用的是开放平台模式。3、京东采用自买自卖的模式，赚取商品中间的差价，通过低收益来获取规模化的销量；淘宝则并不参与商品的实际销售和服务，商品的销售以及服务都是由淘宝卖家直接负责的。4、京东有自己的物流平台，采用的是分布式库存管理；淘宝依赖于第三方物流平台，采用的是集约式库存管理。

京东公司全称是什么Oct 11, 2022 pm 04:05 PM

京东公司全称是“北京京东世纪贸易有限公司”，是一家综合网络零售企业，公司旗下产业京东商城是中国电子商务领域最受消费者欢迎和最具有影响力的电子商务网站之一，拥有在线销售家电、数码通讯、电脑、家居百货、服装服饰、母婴、图书、食品、在线旅游等12大类数万个品牌商品。

京东自营和京东国际自营有什么区别Oct 17, 2022 pm 03:30 PM

区别：1、京东自营包括京东国际自营和京东国内商家。2、京东自营分为国内产品和国外产品，京东国际自营也是自营店，产品直接从海外采买。3、京东自营是京东集团很多子公司在京东商城平台上销售，而京东商城与京东国际自营也是一种合作关系；京东国际自营是京东集团子公司之一，是一家在境外注册的境外销售公司。

京东物流单号jdvc是什么业务Nov 01, 2022 am 11:17 AM

jdvc是京东快递业务。京东快递是京东物流的服务之一，其主要业务是为京东商城自营的产品进行运输配送，京东商场在全国各地都有建立保税仓，将买家的订单分配到就近的仓库，再由京东快递打包运输，一至三天即可送到。京东快递除了运输商务快递，还开通了个人运输，可以通过小程序、APP、和公众号预约下单，价格比其他快递公司更便宜，大约2～3天左右即可送货上门。

京东可以用支付宝支付吗Jul 07, 2022 am 11:37 AM

京东不可以用支付宝支付，在京东的支付界面“京东收银台”中没有“支付宝”的付款渠道，因为京东和支付宝并没有支付合作关系。京东支持的付款方式有：微信支付、云闪付、银行卡支付、货到付款、微信好友代付。

爬虫实战：用 PHP 爬取京东商品信息Jun 13, 2023 am 11:11 AM

在当今的电商时代，京东作为中国最大的综合电商之一，每日上架的商品数量甚至可以达到数万种。对于广大的消费者来说，京东提供了广泛的商品选择和优势的价格优惠。但是，有些时候，我们需要批量获取京东商品信息，快速筛选、比较、分析等等。这时候，我们就需要用到爬虫技术了。在本篇文章中，我们将会介绍利用PHP语言编写爬虫，帮助我们快速爬取京东商品信息的实现。准备工作首先，我

京东在线配镜流程Nov 08, 2023 pm 03:19 PM

京东在线配镜流程是：1、挑选镜架；2、挑选镜片；3、定制镜片；4、确认订单；5、支付订单；6、等待配送；7、验货与试戴；8、确认收货。在配镜前最好先去医院或专业的眼镜店进行验光，了解自己的近视度数和瞳距等信息，以便选择合适的镜片参数。京东的配镜服务可能会有所不同，具体流程和价格等信息可以在其官方网站上查询。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Where to find the Crane Control Keycard in Atomfall

1 weeks agoByDDD

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Hot Topics

Where is the login entrance for gmail email?

7416

CakePHP Tutorial

1359

What is the format of the account name of steam

win11 activation key permanent