python - 爬虫获取所有数据的思路是什么

Question

比如一个网站有下一页，我要怎么能把所有下一页爬完呢，用递归吗，递归深度不会有限制吗，初学，希望得到指点

大家讲道理 · Answer

Recursion, message queue, storage of crawled pages (redis, database)

巴扎黑 · Answer

If all the data you are referring to is all the data under a small domain name, and you don’t want to study the principles in detail, then learn scrapy.

If all the data you are referring to is the entire network data, and you want to understand whether crawling is breadth-first or depth-first, etc., then you must first have 10,000+ servers.

怪我咯 · Answer

If it’s the same website, use recursion to crawl it. Why can’t the same website be crawled to the end?

巴扎黑 · Answer

If the structure of the website is simple and repetitive, you can first analyze the pattern of page number URLs, then get the total number of pages directly from the first page, and then manually construct the URLs of other pages.

PHP中文网 · Answer

First of all, let’s briefly talk about the idea of crawling. If the page link is very simple, like www.xxx.com/post/1.html, you can write recursion or loop to crawl it

If the page link is unknown, you can get the crawled page to parse the link of the tag, and then continue crawling. In this process, you need to save the crawled links and look for them when crawling new links. Check if it has been crawled before, and then crawl it recursively

Crawling idea: Crawling through url -> Parsing the new url in the crawled content -> Crawling through url ->....-> When crawling to a certain number or there are no new links for a long time Break out of recursion

Finally, there is a very powerful crawler framework scrapy in the python world. It basically encapsulates all common crawler routines. You can use the portal with a little learning

阿神 · Answer


import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;

import org.apache.commons.io.FileUtils;



public class SpiderDemo {
    public static void main(String[] args) throws IOException {
//        URL url = new URL("http://www.zhongguoxinyongheimingdan.com");
//        URLConnection connection = url.openConnection();
//        InputStream in = connection.getInputStream();
//        File file = new File("F://a.txt");
//        FileUtils.copyInputStreamToFile(in, file);
        File srcDir = new File("F://a.txt");
        String str = FileUtils.readFileToString(srcDir, "UTF-8");
        String[] str1 = str.split("href=");
        for (int i = 3; i < str1.length-1; i++) {
            URL url = new URL("http://www.zhongguoxinyongheimingdan.com"+str1[i].substring(1, 27));
            File f = new File("F://abc//"+str1[i].substring(2, 22));
            if(!f.exists()){
            f.mkdir();    
            File desc1 = new File(f,str1[i].substring(1, 22)+".txt");
            URLConnection connection = url.openConnection();
            InputStream in = connection.getInputStream();
            FileUtils.copyInputStreamToFile(in, desc1);
            String str2 = FileUtils.readFileToString(desc1, "UTF-8");
            String[] str3 = str2.split("\" src=\"");
            for(int j = 1;j


Simple code to save all photos from China credit blacklist website to local The website itself is simple! But the website crashed on the spot and I was drunk!

python - 爬虫获取所有数据的思路是什么

reply all(6)I'll reply