Home  >  Article  >  Backend Development  >  Use PHP to crawl product data from Tmall and Taobao

Use PHP to crawl product data from Tmall and Taobao

不言
不言Original
2018-06-19 11:05:392638browse

This article mainly introduces in detail the method of crawling Tmall and Taobao product data with PHP. It has certain reference value. Interested friends can refer to it

1. Idea

I recently made a website that crawled product information from Tmall and Taobao from the URL. First I looked at the mobile webpage and found that react was used. I didn’t know much about it and couldn’t do it, so I Consider crawling the data from the PC portal, but when crawling the URL to obtain the data, the price, inventory, etc. information is not obtained. After careful study, I found that another interface is requested asynchronously, but the interface requires a refer to obtain the data, so I A simple crawler is written in the following way, which is used to crawl the product preview and the price, inventory, etc. of the first category of the product.

2. The code to implement

is as follows:

function crawlUrl($url){
import('PhpQuery.Curl');
 $curl=new \Curl();
 $result = $curl->read($url);
 $content = mb_convert_encoding( $result['content'], 'UTF-8', 'UTF-8,GBK,GB2312,BIG5' );
 $myres=array();
 if(strrpos($url,'taobao.com')!=false) {
  //匹配是否下架
  if(strpos($content,'此宝贝已下架')!==false){
   return false;
  }
  preg_match("|itemId   : '(.*)'|isU", $content, $match);
  $item_id=$match[1];
  preg_match("|sellerId   : '(.*)'|isU", $content, $match);
  $sellet_id=$match[1];
  preg_match("|<title>(.*)</title>|isU",$content,$match);
  $title=$match[1];
  //价格库存信息
  $ch = curl_init();
  curl_setopt ($ch, CURLOPT_URL, &#39;https://detailskip.taobao.com/service/getData/1/p1/item/detail/sib.htm?itemId=&#39;.$item_id.&#39;&sellerId=&#39;.$sellet_id.&#39;&modules=dynStock,qrcode,viewer,price,duty,xmpPromotion,delivery,upp,activity,fqg,zjys,amountRestriction,couponActivity,soldQuantity,originalPrice,tradeContract&callback=onSibRequestSuccess&#39;);
  $opt[CURLOPT_HEADER]=false;
  $opt[CURLOPT_CONNECTTIMEOUT]=15;
  $opt[CURLOPT_TIMEOUT]=300;
  $opt[CURLOPT_AUTOREFERER]=true;
  $opt[CURLOPT_USERAGENT]=&#39;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11&#39;;
  curl_setopt_array($ch,$opt);
  curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt ($ch,CURLOPT_REFERER,$url);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  $out_put=curl_exec ($ch);
  curl_close ($ch);
  $res=str_replace(&#39;onSibRequestSuccess(&#39;,"",$out_put);
  $res=rtrim($res,&#39;);1&#39;);
  $result=json_decode($res,true);
  //查询出图片信息
  preg_match(&#39;|<ul id="J_UlThumb" class="tb-thumb tb-clearfix">(.*)</ul>|isU&#39;, $content, $match);
  preg_match_all(&#39;/<img src="(.*?)" \//&#39;, $match[1], $images);

  $myres[&#39;title&#39;]=str_replace(&#39;-淘宝网&#39;,&#39;&#39;,$title);

  $myres[&#39;price&#39;]=current($result[&#39;data&#39;][&#39;originalPrice&#39;]);

  $myres[&#39;act_price&#39;]=current($result[&#39;data&#39;][&#39;promotion&#39;][&#39;promoData&#39;]);

  $myres[&#39;stock&#39;]=$result[&#39;data&#39;][&#39;dynStock&#39;][&#39;stock&#39;];

  $myres[&#39;banners&#39;]=$images[1];
 }else{
  //匹配是否下架
  if(strpos($content,&#39;此宝贝已下架&#39;)!==false){
   return false;
  }
  $start=strpos($url,&#39;&id=&#39;);
  $item_id=substr($url,$start+4,12);
  if(!is_numeric($item_id)){
   $start=strpos($url,&#39;?id=&#39;);
   $end=strpos($url,&#39;&spm&#39;);
   $item_id=substr($url,$start+4,$end-$start-4);
  }
  preg_match("|<title>(.*)</title>|isU",$content,$match);
  $title=$match[1];
  $myurl=&#39;https://mdskip.taobao.com/core/initItemDetail.htm?cachedTimestamp=1500562177777&queryMemberRight=true&cartEnable=true&offlineShop=false&addressLevel=2&itemId=&#39;.$item_id.&#39;&tryBeforeBuy=false&isAreaSell=false&tmallBuySupport=true&isPurchaseMallPage=false&household=false&isForbidBuyItem=false&service3C=false&isRegionLevel=false&showShopProm=false&isSecKill=false&sellerPreview=false&isUseInventoryCenter=false&isApparel=true&callback=setMdskip&timestamp=1500562172109&isg=AiUlDZFWmP/sMgVurQSILU3Ytet/Zdis&isg2=Ajk51JIhRFqKzxmiNPP6dkYxSKXT7iySkzSTeVtu9WDf4ll0o5Y9yKdyEtHu&#39;;
  //价格库存信息
  $ch = curl_init();
  curl_setopt ($ch, CURLOPT_URL, $myurl);
  $opt[CURLOPT_HEADER]=false;
  $opt[CURLOPT_CONNECTTIMEOUT]=15;
  $opt[CURLOPT_TIMEOUT]=300;
  $opt[CURLOPT_AUTOREFERER]=true;
  $opt[CURLOPT_USERAGENT]=&#39;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11&#39;;
  curl_setopt_array($ch,$opt);
  curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt ($ch,CURLOPT_REFERER,$url);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  $out_put=curl_exec ($ch);
  curl_close ($ch);
  $res = mb_convert_encoding( $out_put, &#39;UTF-8&#39;, &#39;UTF-8,GBK,GB2312,BIG5&#39; );
  $res=str_replace(&#39;setMdskip&#39;,"",$res);
  $res=str_replace(&#39;(&#39;,"",$res);
  $res=str_replace(&#39;)&#39;,"",$res);
  $result=json_decode($res,true);
  $nowk="";
  $nowstore="";
  foreach($result[&#39;defaultModel&#39;][&#39;inventoryDO&#39;][&#39;skuQuantity&#39;] as $k=>$val){
   $nowk=$k;
   $nowstore=$val;
   break;
  }

  $myres[&#39;title&#39;]=str_replace(&#39;-tmall.com天猫&#39;,&#39;&#39;,$title);

  $myres[&#39;price&#39;]=$result[&#39;defaultModel&#39;][&#39;itemPriceResultDO&#39;][&#39;priceInfo&#39;][$nowk][&#39;price&#39;];

  $myres[&#39;act_price&#39;]=isset($result[&#39;defaultModel&#39;][&#39;itemPriceResultDO&#39;][&#39;priceInfo&#39;][$nowk][&#39;suggestivePromotionList&#39;])?$result[&#39;defaultModel&#39;][&#39;itemPriceResultDO&#39;][&#39;priceInfo&#39;][$nowk][&#39;suggestivePromotionList&#39;]:$result[&#39;defaultModel&#39;][&#39;itemPriceResultDO&#39;][&#39;priceInfo&#39;][$nowk];

  $myres[&#39;stock&#39;]=$result[&#39;defaultModel&#39;][&#39;inventoryDO&#39;][&#39;totalQuantity&#39;]?$result[&#39;defaultModel&#39;][&#39;inventoryDO&#39;][&#39;totalQuantity&#39;]:$nowstore[&#39;quantity&#39;];
  //查询出图片信息
  preg_match(&#39;|<ul id="J_UlThumb" class="tb-thumb tm-clear">(.*)</ul>|isU&#39;,$content, $match);
  preg_match_all(&#39;/<img src="(.*?)" \//&#39;,$match[1],$images);
  $myres[&#39;banners&#39;]=$images[1];
 }
 return $myres;
}

The above code uses the phpquery library , but it is actually useless. Just use Curl. The specific crawled data can be viewed through parameters. The method does not distinguish between Taobao and Tmall links, but the premise is that it must be a PC-side link. In addition, the regular expression is not standardized, so it can Rewrite the regex yourself to match the data.

The above is the entire content of this article. I hope it will be helpful to everyone's study. For more related content, please pay attention to the PHP Chinese website!

Related recommendations:

About php Encryption issues during development

How to use PHP to achieve page registration and audit

##

The above is the detailed content of Use PHP to crawl product data from Tmall and Taobao. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn