Home  >  Article  >  Backend Development  >  Blog crawling system

Blog crawling system

WBOY
WBOYOriginal
2016-08-08 09:30:25982browse

Introduction

I had nothing to do on the weekend and was bored, so I made a blog crawling system using php. I often visit cnblogs. Of course, I started from the blog park (see, I still like the blog park). My crawling is relatively simple. Get the content of the web page, and then use regular matching to get what you want, and then save the database. Of course, you will encounter some problems in the actual process. I have already thought about it before doing this, and I want it to be expandable. If I want to add csdn, 51cto, Sina blog and other content in the future, it can be easily expanded.

Those things can be grabbed?

First of all, let me say something. This is a simple crawling. Not all things seen on the webpage can be crawled. Some things cannot be crawled, like the following ones

  Among them, the number of readings, comments, recommendations, objections, comments... These are obtained dynamically through js calling ajax, so they cannot be obtained. In fact, in one sentence, you open a web page and right-click Click to view the source code. You cannot see it directly in the source code. There may be problems with this simple crawling. If you want to crawl the ajax-filled content, you need to think of other methods. I have seen an article before and someone passed it first. After the browser loads the web page, it then filters the entire DOM (the article also mentioned that this is very inefficient). Of course, it is also possible to splice these js requests, but it will probably be more troublesome.

The idea of ​​crawling

First let’s talk about the crawling depth

For example, start crawling from link a. If the depth is 1, just get the content of the current link. If the depth is 2, then match the link from the content of link a according to the specified rules, and match the matched link. Also perform processing with a depth of 1, and so on. Depth is the depth and level of the link. Only in this way can the crawler "crawl".

  Of course, if you use a link to crawl specific content, the things you can crawl are very limited, or you may die before crawling (the subsequent levels do not match the content), so when crawling Multiple starting links can be set at any time. Of course, you are likely to encounter many duplicate links when crawling, so you have to mark the crawled links to prevent repeated acquisition of the same content, causing redundancy. There are several variables to cache this information, the format is as follows

First, it is a hash array, the key value is the md5 value of the url, the status is 0, maintain a unique url array, in the form of the following

<span>Array</span><span>
(
    [bc790cda87745fa78a2ebeffd8b48145] </span>=> 0<span>
    [9868e03f81179419d5b74b5ee709cdc2] </span>=> 0<span>
    [4a9506d20915a511a561be80986544be] </span>=> 0<span>
    [818bcdd76aaa0d41ca88491812559585] </span>=> 0<span>
    [9433c3f38fca129e46372282f1569757] </span>=> 0<span>
    [f005698a0706284d4308f7b9cf2a9d35] </span>=> 0<span>
    [e463afcf13948f0a36bf68b30d2e9091] </span>=> 0<span>
    [23ce4775bd2ce9c75379890e84fadd8e] </span>=> 0
    ......<span>
)</span>

The second one is the url array to be obtained. This place can also be optimized. I obtain all the links into the array, and then loop through the array to obtain the content. This is equivalent to saying that all the maximum depth is reduced by 1. It has been obtained twice. Here you can directly obtain the content when obtaining the next level content, and then change the status in the above array to 1 (already obtained), which can improve efficiency. First look at the contents of the array that saves the link:

<span>Array</span><span>
(
    [</span>0] => <span>Array</span><span>
        (
            [</span>0] => http:<span>//</span><span>zzk.cnblogs.com/s?t=b&w=php&p=1</span>
<span>        )
    [</span>1] => <span>Array</span><span>
        (
            [</span>0] => http:<span>//</span><span>www.cnblogs.com/baochuan/archive/2012/03/12/2391135.html</span>
            [1] => http:<span>//</span><span>www.cnblogs.com/ohmygirl/p/internal-variable-1.html</span>
            [2] => http:<span>//</span><span>www.cnblogs.com/zuoxiaolong/p/java1.html</span>
                ......<span>
        )

    [</span>2] => <span>Array</span><span>
        (
            [</span>0] => http:<span>//</span><span>www.cnblogs.com/ohmygirl/category/623392.html</span>
            [1] => http:<span>//</span><span>www.cnblogs.com/ohmygirl/category/619019.html</span>
            [2] => http:<span>//</span><span>www.cnblogs.com/ohmygirl/category/619020.html</span>
                ......<span>
        )

)</span>

Finally, all the links are combined into an array and returned, and the program loops to obtain the content in the connection. Just like the above acquisition level is 2, the link content of level 0 has been acquired, and it is only used to obtain the links in level 1. All the link content in level 1 has also been acquired, and it is only used to save the links in level 2. , when the content is actually obtained, the above content will be obtained again, and the status in the above hash array is not used. . . (To be optimized).

There is also a regular rule for obtaining articles. By analyzing the content of articles in the blog park, it is found that the title and body of the article can basically be obtained in a very regular way

The title and title html code are all in the format shown below, which can be easily matched using the following regular expression

<span>#</span><span><a\s*?id=\"cb_post_title_url\"[^>]*?>(.*?)<\/a>#is</span>

The text, the text part can be easily obtained through the advanced feature balance group of regular expressions, but after working for a long time, I found that PHP does not seem to support the balance group very well, so I gave up the balance group and found it in the html source code You can also easily match the content of the article body through the following regular rules. Each article basically has the content in the picture below

<span>#</span><span>(<div\s*?id=\"cnblogs_post_body\"[^>]*?>.*)<div\s*id=\"blog_post_info_block\">#is</span>

Start:

End:

  博客的发布时间也是可以获取到的,但有些文章在获取发布时间的时候可能会找不到,这个就不列在这里了,有了这些东西就可以爬取内容了。

开始爬取

  开始爬取内容了,最初我设置的爬取深度是2级,初始页面是博客园首页,发现爬取不了多少内容,后来发现博客园首页有个页码导航

  就试图拼接成页码格式http://www.cnblogs.com/#p2,循环200次,以每页为起始页面,深度为2去抓取。但我高兴的太早了,开了几个进程跑了好久程序,抓了几十万条,后来发现完全在重复,都是从第一页中抓取的,因为博客园首页点击导航的时候(除了第一页),都是ajax请求获取到的。。。。看来博客园还是考虑到这个问题,因为大多数人都是只打开首页,不会去点击后面的内容(我可能偶尔会去点击下一页),所以为了在防止初级抓取者去抓取和性能发面做权衡,将第一页设置为静态网页的方式,缓存有效期是几分钟(或者是根据跟新频率,当更新多少篇的时候去更新缓存,或者两者的结合),这也是为什么有时候发布的文章,过一会儿才会显示出来的原因(我猜的^_^)。

  难道不能一次性抓取很多内容吗?后来我发现这个地方使用的全部是静态网页

     从找找看这个地方获取到的内容都是静态的,包括最下面的导航链接中的所有页面都是静态的,而且,这个搜索右边还有筛选条件,可以更好的提高抓取的质量。好了有了这个入口,就可以获取到好多高质量的文章了,下面是循环抓取100页的代码

<span>for</span>(<span>$i</span>=1;<span>$i</span><=100;<span>$i</span>++<span>){
            </span><span>echo</span> "PAGE{<span>$i</span>}*************************[begin]***************************\r"<span>;
            </span><span>$spidercnblogs</span> = <span>new</span> C\Spidercnblogs("http://zzk.cnblogs.com/s?t=b&w=php&p={$i}"<span>);
            </span><span>$urls</span> = <span>$spidercnblogs</span>-><span>spiderUrls();
            </span><span>die</span><span>();
            </span><span>foreach</span> (<span>$urls</span> <span>as</span> <span>$key</span> => <span>$value</span><span>) {
                </span><span>$cnblogs</span>->grap(<span>$value</span><span>);
                </span><span>$cnblogs</span>-><span>save();
            }
        }</span>

  至此,就可以去抓去自己喜欢的东西了,抓取速度不是很快,我在一台普通pc上面开了10个进程,抓了好几个小时,才获取到了40多万条数据,好了看看抓取到的内容稍微优化之后的显示效果,这里面加上了博客园的基础css代码,可以看出效果和

抓取到的内容稍作修改:

原始内容

 

 

 再看下文件目录结构,也是用上篇的自制目录生成工具生成的:

 +myBlogs-master
    +controller
        |_Blog.php
        |_Blogcnblogs.php
        |_Spider.php
        |_Spidercnblogs.php
    +core
        |_Autoload.php
    +interface
        |_Blog.php
    +lib
        |_Mysql.php
    +model
        |_Blog.php
    |_App.php

 效果还是很不错的,这里再猜下推酷这种专门爬取的网站的工作方式,一个常驻进程,隔一段时间去获取一次内容(比如说首页),如果有新鲜的内容入库,没有的话放弃这次获取的内容,等待下次获取,当时间很小的时候就可以一篇不漏的抓取的”新鲜“的内容。

这是github地址:

github——myBlogs

  本文版权归作者iforever(luluyrt@163.com)所有,未经作者本人同意禁止任何形式的转载,转载文章之后必须在文章页面明显位置给出作者和原文连接,否则保留追究法律责任的权利。

以上就介绍了博客爬取系统,包括了方面的内容,希望对PHP教程有兴趣的朋友有所帮助。

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn