search

Home  >  Q&A  >  body text

python - 爬虫获取所有数据的思路是什么

比如一个网站有下一页,我要怎么能把所有下一页爬完呢,用递归吗,递归深度不会有限制吗,初学,希望得到指点

ringa_leeringa_lee2862 days ago665

reply all(6)I'll reply

  • 大家讲道理

    大家讲道理2017-04-18 10:21:45

    Recursion, message queue, storage of crawled pages (redis, database)

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-18 10:21:45

    If all the data you are referring to is all the data under a small domain name, and you don’t want to study the principles in detail, then learn scrapy.

    If all the data you are referring to is the entire network data, and you want to understand whether crawling is breadth-first or depth-first, etc., then you must first have 10,000+ servers.

    reply
    0
  • 怪我咯

    怪我咯2017-04-18 10:21:45

    If it’s the same website, use recursion to crawl it. Why can’t the same website be crawled to the end?

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-18 10:21:45

    If the structure of the website is simple and repetitive, you can first analyze the pattern of page number URLs, then get the total number of pages directly from the first page, and then manually construct the URLs of other pages.

    reply
    0
  • PHP中文网

    PHP中文网2017-04-18 10:21:45

    First of all, let’s briefly talk about the idea of ​​crawling. If the page link is very simple, like www.xxx.com/post/1.html, you can write recursion or loop to crawl it

    If the page link is unknown, you can get the crawled page to parse the link of the tag, and then continue crawling. In this process, you need to save the crawled links and look for them when crawling new links. Check if it has been crawled before, and then crawl it recursively

    Crawling idea: Crawling through url -> Parsing the new url in the crawled content -> Crawling through url ->....-> When crawling to a certain number or there are no new links for a long time Break out of recursion

    Finally, there is a very powerful crawler framework scrapy in the python world. It basically encapsulates all common crawler routines. You can use the portal with a little learning

    reply
    0
  • 阿神

    阿神2017-04-18 10:21:45

    
    import java.io.File;
    import java.io.IOException;
    import java.io.InputStream;
    import java.net.URL;
    import java.net.URLConnection;
    
    import org.apache.commons.io.FileUtils;
    
    
    
    public class SpiderDemo {
        public static void main(String[] args) throws IOException {
    //        URL url = new URL("http://www.zhongguoxinyongheimingdan.com");
    //        URLConnection connection = url.openConnection();
    //        InputStream in = connection.getInputStream();
    //        File file = new File("F://a.txt");
    //        FileUtils.copyInputStreamToFile(in, file);
            File srcDir = new File("F://a.txt");
            String str = FileUtils.readFileToString(srcDir, "UTF-8");
            String[] str1 = str.split("href=");
            for (int i = 3; i < str1.length-1; i++) {
                URL url = new URL("http://www.zhongguoxinyongheimingdan.com"+str1[i].substring(1, 27));
                File f = new File("F://abc//"+str1[i].substring(2, 22));
                if(!f.exists()){
                f.mkdir();    
                File desc1 = new File(f,str1[i].substring(1, 22)+".txt");
                URLConnection connection = url.openConnection();
                InputStream in = connection.getInputStream();
                FileUtils.copyInputStreamToFile(in, desc1);
                String str2 = FileUtils.readFileToString(desc1, "UTF-8");
                String[] str3 = str2.split("\" src=\"");
                for(int j = 1;j<str3.length-2;j++){
                    URL url1 = new URL(str3[j].substring(0, 81));
                    URLConnection connection1 = url1.openConnection();
                    connection1.setDoInput(true);
                    InputStream in1 = connection1.getInputStream();
                    File desc2 = new File(f,str3[j].substring(44,76)+".jpg");
                    FileUtils.copyInputStreamToFile(in1, desc2);
                }
                }
                }
            }
        
    }
    

    Simple code to save all photos from China credit blacklist website to local The website itself is simple! But the website crashed on the spot and I was drunk!

    reply
    0
  • Cancelreply