search
HomeJavajavaTutorialHow to write a web crawler in java

How to write a web crawler in java

May 28, 2019 pm 01:29 PM
java

网络爬虫

网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。

How to write a web crawler in java

聚焦爬虫工作原理以及关键技术概述

网络爬虫是一个自动提取网页的程序,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成。传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。聚焦爬虫的工作流程较为复杂,需要根据一定的网页分析算法过滤与主题无关的链接,保留有用的链接并将其放入等待抓取的URL队列。然后,它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL,并重复上述过程,直到达到系统的某一条件时停止。另外,所有被爬虫抓取的网页将会被系统存贮,进行一定的分析、过滤,并建立索引,以便之后的查询和检索;对于聚焦爬虫来说,这一过程所得到的分析结果还可能对以后的抓取过程给出反馈和指导。

相对于通用网络爬虫,聚焦爬虫还需要解决三个主要问题:

(1) 对抓取目标的描述或定义;

(2) 对网页或数据的分析与过滤;

(3) 对URL的搜索策略。

网络爬虫的实现原理

根据这种原理,写一个简单的网络爬虫程序 ,该程序实现的功能是获取网站发回的数据,并提取之中的网址,获取的网址我们存放在一个文件夹中。除了提取网址,我们还可以提取其他各种我们想要的信息,只要修改过滤数据的表达式则可以。

以下是利用Java模拟的一个程序,提取新浪页面上的链接,存放在一个文件里

源代码如下:

package com.cellstrain.icell.util;

import java.io.*;
import java.net.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * java实现爬虫
 */
public class Robot {
    public static void main(String[] args) {
        URL url = null;
        URLConnection urlconn = null;
        BufferedReader br = null;
        PrintWriter pw = null;
//        String regex = "http://[\\w+\\.?/?]+\\.[A-Za-z]+";
        String regex = "https://[\\w+\\.?/?]+\\.[A-Za-z]+";//url匹配规则
        Pattern p = Pattern.compile(regex);
        try {
            url = new URL("https://www.rndsystems.com/cn");//爬取的网址、这里爬取的是一个生物网站
            urlconn = url.openConnection();
            pw = new PrintWriter(new FileWriter("D:/SiteURL.txt"), true);//将爬取到的链接放到D盘的SiteURL文件中
            br = new BufferedReader(new InputStreamReader(
                    urlconn.getInputStream()));
            String buf = null;
            while ((buf = br.readLine()) != null) {
                Matcher buf_m = p.matcher(buf);
                while (buf_m.find()) {
                    pw.println(buf_m.group());
                }
            }
            System.out.println("爬取成功^_^");
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                br.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            pw.close();
        }
    }
}

The above is the detailed content of How to write a web crawler in java. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
ZipInputStream failed to decompress Chinese file name? How to set the character set correctly?ZipInputStream failed to decompress Chinese file name? How to set the character set correctly?Apr 19, 2025 pm 04:33 PM

Discussion on ZipInputStream character set settings Many developers use ZipInputStream to decompress zip compressed packages containing Chinese file names or folder names, �...

How to implement a retry strategy from serverB to serverC using Spring WebFlux when building LLM gateway?How to implement a retry strategy from serverB to serverC using Spring WebFlux when building LLM gateway?Apr 19, 2025 pm 04:30 PM

Implementing the retry strategy using SpringWebFlux in building an LLM...

How to ensure that @Scheduled timing tasks are executed only once in Spring Boot multi-node environment?How to ensure that @Scheduled timing tasks are executed only once in Spring Boot multi-node environment?Apr 19, 2025 pm 04:21 PM

How to avoid repeated execution of timed tasks in SpringBoot multi-node environment? In Spring...

In object-oriented programming: Are attributes and states really equivalent?In object-oriented programming: Are attributes and states really equivalent?Apr 19, 2025 pm 04:18 PM

Deeply discussing properties and states in object-oriented programming. In object-oriented programming, the concepts of properties and state are often confused, and there is a subtle between them...

How to deal with a number overflow error when connecting to Oracle database in IDEA?How to deal with a number overflow error when connecting to Oracle database in IDEA?Apr 19, 2025 pm 04:15 PM

How to deal with digital overflow errors when connecting to Oracle database in IDEA When we are using IntelliJ...

How to use @ResultType annotation correctly in MyBatis?How to use @ResultType annotation correctly in MyBatis?Apr 19, 2025 pm 04:12 PM

When studying the MyBatis framework, developers often encounter various problems about annotations. One of the common questions is how to use the @ResultType annotation correctly...

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool