Home  >  Article  >  Backend Development  >  Example of QueryList parsing WeChat articles

Example of QueryList parsing WeChat articles

小云云
小云云Original
2018-03-27 10:46:564161browse

最近工作中需要做一个通过微信文章url抓取微信文章的功能,网页解析使用的是QueryList。将代码实现的逻辑记录一下。希望能帮助到大家。

具体实现代码如下:

/**
     * @param $url  微信文章url
     * @return bool
     */
    function spideWx($url){
        if(empty($url)) return false;
        $_host = parse_url($url, PHP_URL_HOST);  //获取主机名
        if($_host !== 'mp.weixin.qq.com') return false;  //不是来自微信文章
        $html = file_get_contents($url);
        if(empty($html)) return false;
        $html = str_replace("<!--headTrap<body></body><head></head><html></html>-->", "", $html);  //去除微信干扰元素!!!否则乱码
        preg_match("/var msg_cdn_url = \".*\"/", $html, $matches);   //获取微信封面图
        $coverImgUrl = $matches[0];
        $coverImgUrl = substr(explode(&#39;var msg_cdn_url = "&#39;, $coverImgUrl)[1], 0, -1);
        $rules = array(   //设置QueryList的解析规则
            &#39;content&#39; => array(&#39;#js_content&#39;, &#39;html&#39;),  //文章内容
            &#39;title&#39; => array(&#39;#activity-name&#39;, &#39;text&#39;),  //文章标题
            &#39;author&#39;=> array(&#39;.rich_media_meta_text:eq(1)&#39;,&#39;text&#39;),  //作者
            &#39;account_name&#39; => array(&#39;#js_profile_qrcode .profile_nickname&#39;,&#39;text&#39;),  //公众号
            &#39;account_en_name&#39; => array(&#39;#js_profile_qrcode .profile_meta:eq(0) .profile_meta_value&#39;,&#39;text&#39;),  //公众号英文标识
        );
        //替换图片链接,解决微信图片防盗链
        $_link = &#39;http://read.html5.qq.com/image?src=forum&q=5&r=0&imgflag=7&imageUrl=&#39;;
        $data = QueryList::Query($html,$rules)->getData();   //执行解析
        $_res = $data[0];  //获取解析结果
        if(empty($_res)) return false;  //解析失败
        $_res[&#39;thumb&#39;] = $_link.$coverImgUrl;   //封面图
        $_res[&#39;title_crc&#39;] = sprintf("%u", crc32($_res[&#39;title&#39;]));   //标题crc
        $_res[&#39;url_crc&#39;] = sprintf("%u", crc32($url));   //url-crc
        $pattern = &#39;/<img([^>]*)src\s*=\s*([\&#39; "])([\s\S]*?)([^>]*)/&#39;;    //正则替换内容中的图片链接
        $_res[&#39;content&#39;] = preg_replace($pattern, &#39;<img$1src=$2&#39;.$_link.&#39;$3$4&#39;, $_res[&#39;content&#39;]);
        return $_res;
    }

相关推荐:

PHP 用QueryList抓取网页内容

用QueryList采集金山词霸《每日一句》

QueryList最简单的PHP采集工具,以采集百度乐彩彩票开奖号为例

The above is the detailed content of Example of QueryList parsing WeChat articles. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn