Home >Backend Development >PHP Tutorial >Practical crawler combat: Use PHP to crawl JD.com product information

Practical crawler combat: Use PHP to crawl JD.com product information

PHPz
PHPzOriginal
2023-06-13 11:11:192130browse

In today’s e-commerce era, JD.com, as one of China’s largest comprehensive e-commerce companies, can even put tens of thousands of products on its shelves every day. For the majority of consumers, JD.com provides a wide range of product selections and advantageous price concessions. However, sometimes, we need to obtain JD product information in batches, quickly screen, compare, analyze, etc. At this time, we need to use crawler technology. In this article, we will introduce the implementation of using PHP language to write a crawler to help us quickly crawl JD.com product information.

  1. Preparation

First, we need to install the curl extension required by PHP and set some commonly used variables. The specific steps are as follows:

First, open the terminal or powershell and enter the following command to install the curl extension package:

sudo apt-get install php7.0-curl //ubuntu系统安装
brew install curl-openssl php-curl //macOS系统安装

Next, we need to set some simple variables in the PHP code to facilitate us used in subsequent code. For example, we define a $jgname variable to represent the access address of JD.com, and another $skulist variable to represent the access address of each product. The code is as follows:

$jgname= "https://list.jd.com/list.html?cat=1318,1486,1490&ev=exbrand_13910&sort=sort_rank_asc&trans=1&JL=3_%E5%93%81%E7%89%8C_%E5%B0%8F%E7%B1%B3%EF%BC%88MI%EF%BC%89#J_crumbsBar";
$skulist="https://item.jd.com/1285310.html";
  1. Get the product list

Now that we have prepared the environment and required variables, we can start writing our crawler. First, we need to obtain the product list of the target JD product page. We can use curl tools and regular expressions to obtain the target link based on the access address of the JD.com product page (i.e. $jgname). Get product information such as price, number of reviews, product name, product number, etc. respectively.

The specific code is as follows:

$ch = curl_init();//初始化curl

curl_setopt($ch, CURLOPT_URL,$jgname);//设置url属性
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);//设置是否将curl_exec()获取的信息以字符串返回,而不是直接输出
$result = curl_exec ($ch);//执行一个curl会话
curl_close ($ch);//关闭curl会话

preg_match_all("/<li .*?</li>/", $result, $matches);//正则表达式把需要的内容取出来,即匹配<li>标签

$goodsinfo=array();//创建一个商品列表

foreach ($matches[0] as $item) {
    //获取商品信息
    preg_match("/sku="(d+)"/",$item,$skuid);
    preg_match("/标题">s{0,}([dD]+?)s{0,}</a>/",$item,$titlename);
    preg_match("/<strong>¥</strong>[s
]{0,}<i>(d+.d+)</i>/",$item,$price);
    preg_match("/<divs{0,}class="p-commit">[s
]+<strong[^>]+>(d+)/",$item,$commentnum);
    preg_match("/<as{0,}href="([dD]+?)"/",$item,$link);

    //将商品信息存储到商品列表中
    $goods=array(
         "title"=>trim($titlename[1]),
         "price"=>trim($price[1]),
         "link"=>"https:".$link[1],
         "skuid"=>trim($skuid[1]),
         "commentnum"=>trim($commentnum[1])
    );
    array_push($goodsinfo,$goods);//将商品信息添加到商品列表

    //输出测试:打印商品信息
    echo $goods['title']." ".$goods['price']." ".$goods['commentnum']." ".$goods['link']."<br>";
}

In the above code, we store the link and number of each product obtained in $goods'skuid' and 'link', and Other useful information (price, number of reviews, etc.) is placed in the $goods array. Finally, it is added to the $goodsinfo array through the array_push() function. You can use loop statements to output product list information for easy viewing of crawling results.

  1. Get product details

Now, we have obtained the product list information in the JD product table page, the next step is to obtain the detailed information of each product , and store it in the $goods array. We have obtained the number and link of each product in the $goods array in the previous step. Therefore, the next step is to open each link to obtain various useful product information. The specific code is as follows:

foreach ($goodsinfo as &$goods) {
    //更新每个商品的网页链接
    $link="https://item.jd.com/".$goods['skuid'].".html";
    $goods['link']=$link;

    $canBuy=true;//官网上可以买
    //判断是否能够购买
    preg_match('/无货/',file_get_contents($link)) && ($canBuy=false);

    //利用curl工具打开网页链接,获得网页代码
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$link);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    $html = curl_exec ($ch);
    curl_close ($ch);
    //分析网页代码,使用正则表达式获取商品种类,价格,颜色,库存数量等数据,并保存
    preg_match_all('/<divs{0,}class="Ptable".*?>[s
]+<divs{0,}class="Ptable-item".*?>[s
]+([dD]*?)</div>/',$html,$items);
    preg_match_all('/<strong>商品名称</strong><em>(d.*)</em>/',$html,$item);
    $goods['title']=$item[1][0];
    echo $goods['title'];

    if($canBuy)
    {
        foreach ($items[1] as &$item) {
            //去掉html标记、空格、换行符
            $item=strip_tags($item);
            $item=str_replace(" ","",$item); 
            $item=str_replace("    ","",$item); 
            $item=str_replace("
","",$item);
            $item=str_replace("
","",$item); 

            //切割字符串,获取键值对
            preg_match_all('/([dD]*?):([dD]*?)[
]/',$item,$item2);
            if(count($item2[1])>0){
                for($i=0;$i<count($item2[1]);$i++){
                    if($item2[1][$i]=="价格"){
                        $goods['price']=$item2[2][$i];
                    }elseif($item2[1][$i]=="颜色"){
                        $goods['color']=$item2[2][$i];
                    }elseif($item2[1][$i]=="产地"){
                        $goods['producePlace']=$item2[2][$i];
                    }elseif($item2[1][$i]=="商品编号"){
                        $goods['goodsn']=$item2[2][$i];
                    }elseif($item2[1][$i]=="型号"){
                        $goods['model']=$item2[2][$i];
                    }elseif($item2[1][$i]=="商品毛重"){
                        $goods['grossWeight']=$item2[2][$i];
                    }elseif($item2[1][$i]=="规格"){
                        $goods['specifications']=$item2[2][$i];
                    }
                }
            }
        }
        //获取商品评论数
        preg_match_all('/<as{0,}href="#comment"s{0,}target="_self">s{0,}[dD]+?<strongs{0,}class="curr-num">(d*)</',$html,$comment);
        $goods['commentnum']=$comment[1][0];
    }
}

In these codes, we use a technique similar to step 2, using the curl tool to obtain the detailed link of each product, and then using regular expressions to obtain some useful product information . We can output the obtained product details in the following way:

foreach ($goodsinfo as &$goods) {
    echo $goods['skuid']." ".$goods['title']." ".$goods['price']." ".$goods['commentnum']." ".$goods['link']."<br>";
}

That’s it for the whole process. In actual applications, we can make some adjustments and optimizations to the code based on actual needs, such as adding exception handling, setting request headers, adjusting crawling speed, etc. In short, on this basis, a stable and efficient crawler can be built to obtain JD product information and further assist e-commerce operations and analysis.

The above is the detailed content of Practical crawler combat: Use PHP to crawl JD.com product information. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn