Home > Article > Web Front-end > Jsoup crawls page data and understands HTTP message headers_html/css_WEB-ITnose
Recommend a book: Hacker Attack and Defense Technology Collection. Web Practical Chapter;
By the way, I leave a question: Is it possible to access the web in large quantities through jsoup or Small nameserver, bringing it down? In fact, friends who are familiar with jsoup can use it to parse URLs to do a very shameless thing (the source code is kept confidential). Haha, let’s briefly introduce JSOUP.
jsoup is a Java-based HTML parser that can directly parse a URL address, HTML text string, and HTML file. It provides a very low-effort API to retrieve and manipulate data through DOM, CSS, and jQuery-like manipulation methods.
Official website download address: http://jsoup.org/download, download core library. Import project
1: Parse HTML text string
[java] view plain copy
[java] view plain copy
Collected some information about HTTP message headers:
GET /simple.htm HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
Accept -Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1 ; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Host: localhost:8080
Connection: Keep-Alive
Server The complete HTTP message sent back is as follows:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.1
X-Powered-By: ASP.NET
Date: Fri, 03 Mar 2006 06 :34:03 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Fri, 03 Mar 2006 06:33:18 GMT< ;CR>
ETag: "5ca4f75b8c3ec61:9ee"
Content-Length: 37
hello world body>
Note:
Overview of HTTP request headers
HTTP client program ( For example, a browser must specify the request type (usually GET or POST) when sending a request to the server. If necessary, the client program can also choose to send other request headers. Most request headers are not required, except Content-Length. For POST requests, Content-Length must appear.
The following are some of the most common request headers.
Accept: The MIME type accepted by the browser.
Accept-Charset: The character set acceptable to the browser
Accept-Encoding: The data encoding method that the browser can decode, such as gzip. Servlet can return gzip-encoded HTML pages to browsers that support gzip. . In many cases this can reduce download time by 5 to 10 times.
Accept-Language: The language type desired by the browser, used when the server can provide more than one language version.
Authorization: Authorization information usually appears in the response to the WWW-Authenticate header sent by the server.
Connection: Indicates whether a persistent connection is required. If the Servlet sees that the value here is "Keep-Alive", or the request uses HTTP 1.1 (HTTP 1.1 uses persistent connections by default. It can take advantage of persistent connections to significantly reduce the download time when the page contains multiple elements (such as Applets, images). To achieve this, the Servlet needs to send a Content-Length header in the response. The simplest way to achieve this is to first write the content to a ByteArrayOutputStream, and then calculate its size before officially writing the content out.
Content-Length: Indicates the length of the request message body.
Cookie: This is one of the most important request header information
From: The email address of the request sender, which is used by some special web client programs and will not be used by the browser.
Host: The host and port in the initial URL.
If-Modified-Since: Return the requested content only if it has been modified after the specified date, otherwise return a 304 "Not Modified" response.
Pragma: Specifying a "no-cache" value means that the server must return a refreshed document, even if it is a proxy server and already has a local copy of the page.
Referer: Contains a URL from which the user accesses the currently requested page.
User-Agent: Browser type, this value is very useful if the content returned by the Servlet is related to the browser type.
UA-Pixels, UA-Color, UA-OS, UA-CPU: Non-standard request headers sent by certain versions of IE browsers, indicating screen size, color depth, operating system and CPU type.