Home >Java >javaTutorial >Analyzing the key technologies of Java crawlers: HTTP requests and responses revealed

Analyzing the key technologies of Java crawlers: HTTP requests and responses revealed

王林
王林Original
2023-12-26 09:16:221104browse

Analyzing the key technologies of Java crawlers: HTTP requests and responses revealed

Explore the core technology of Java crawler: HTTP request and response

Introduction:
With the development of the Internet, a large amount of information is stored on the network. In certain scenarios, we may need to extract data from web pages or perform data collection, which requires the use of crawler technology. As a powerful programming language, Java is also widely used in the crawler field. In order to implement an efficient and stable Java crawler, we need to understand the core technology of HTTP requests and responses. This article will introduce the basic knowledge of HTTP requests and responses and provide specific code examples.

1. HTTP request
1.1. HTTP protocol
HTTP (HyperText Transfer Protocol) is an application layer protocol used to transmit hypermedia documents (such as HTML). It is based on the client/server model and communicates via request/response.

1.2. URL and URI
URL (Uniform Resource Locator) is a sequence of characters used to identify and locate resources on the Internet. A resource on the Internet can be uniquely identified using a URL. Example URL: https://www.example.com/index.html.

URI (Uniform Resource Identifier) ​​is a string used to identify a certain resource. It contains multiple subcategories such as URL and URN (Uniform Resource Name). URL is a type of URI.

1.3. HTTP request method
The HTTP request method is used to specify the operation type of the client on the resource requested by the server. Common request methods include GET, POST, PUT, DELETE, etc.

The following is a sample code that uses Java's URLConnection to send a GET request:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;

public class HttpRequestExample {
    public static void main(String[] args) throws Exception {
        // 请求的URL
        String url = "https://www.example.com/index.html";

        // 创建URL对象
        URL obj = new URL(url);

        // 打开连接
        HttpURLConnection con = (HttpURLConnection) obj.openConnection();

        // 设置请求方法为GET
        con.setRequestMethod("GET");

        // 获取响应状态码
        int responseCode = con.getResponseCode();
        System.out.println("响应状态码:" + responseCode);

        // 读取响应内容
        BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
        String inputLine;
        StringBuilder response = new StringBuilder();
        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }
        in.close();

        // 打印响应内容
        System.out.println("响应内容:" + response.toString());
    }
}

2. HTTP response
2.1. Response status code
The HTTP response contains a status line, It contains a 3-digit status code that indicates the processing result of the request. Common status codes include 200 (success), 404 (not found), 500 (internal server error), etc.

2.2. Response header and response body
HTTP response contains one or more response headers and a response body. The response header contains metadata related to the response, such as Content-Type (content type), Content-Length (content length), etc. The response body contains the actual response content.

The following is a sample code that uses Java's HttpURLConnection to receive an HTTP response:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;

public class HttpResponseExample {
    public static void main(String[] args) throws Exception {
        // 请求的URL
        String url = "https://www.example.com/index.html";

        // 创建URL对象
        URL obj = new URL(url);

        // 打开连接
        HttpURLConnection con = (HttpURLConnection) obj.openConnection();

        // 设置请求方法为GET
        con.setRequestMethod("GET");

        // 获取响应状态码
        int responseCode = con.getResponseCode();
        System.out.println("响应状态码:" + responseCode);

        // 获取响应头
        StringBuilder responseHeader = new StringBuilder();
        for (int i = 1; i <= con.getHeaderFields().size(); i++) {
            responseHeader.append(con.getHeaderFieldKey(i)).append(": ").append(con.getHeaderField(i)).append("
");
        }
        System.out.println("响应头:
" + responseHeader.toString());

        // 读取响应内容
        BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
        String inputLine;
        StringBuilder responseBody = new StringBuilder();
        while ((inputLine = in.readLine()) != null) {
            responseBody.append(inputLine);
        }
        in.close();

        // 打印响应内容
        System.out.println("响应内容:" + responseBody.toString());
    }
}

Conclusion:
This article introduces the core technology in Java crawlers-HTTP requests and responses. By understanding the basic knowledge of HTTP request methods, URLs, URIs, etc., we can send different types of HTTP requests as needed. By understanding the HTTP response status code, response headers and response body, we can obtain the response returned by the server and extract the required data from it. These technologies can help us build efficient and stable Java crawlers.

The above is the detailed content of Analyzing the key technologies of Java crawlers: HTTP requests and responses revealed. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn