Home  >  Article  >  Web Front-end  >  Jsoup crawls page data and understands HTTP message headers_html/css_WEB-ITnose

Jsoup crawls page data and understands HTTP message headers_html/css_WEB-ITnose

WBOY
WBOYOriginal
2016-06-24 11:55:501175browse

Recommend a book: Hacker Attack and Defense Technology Collection. Web Practical Chapter;

By the way, I leave a question: Is it possible to access the web in large quantities through jsoup or Small nameserver, bringing it down? In fact, friends who are familiar with jsoup can use it to parse URLs to do a very shameless thing (the source code is kept confidential). Haha, let’s briefly introduce JSOUP.

jsoup is a Java-based HTML parser that can directly parse a URL address, HTML text string, and HTML file. It provides a very low-effort API to retrieve and manipulate data through DOM, CSS, and jQuery-like manipulation methods.

Official website download address: http://jsoup.org/download, download core library. Import project

1: Parse HTML text string

[java] view plain copy

  1. /**
  2. * Parse an html document. String type
  3. */
  4. ublic static void parseStringHtml(String html) {
  5. Document doc = Jsoup.parse(html );//Convert String into document format
  6. Elements e=doc.body().getAllElements();//Get all node sets under body
  7. Elements e1=doc.select ("head");//Get the head node set
  8. Element e2=doc.getElementById("p");//Get the node with id="p" on the html
  9. System. out.println(e1);
2: Parse url. This part is the key point. Some URLs may not be able to obtain direct connections. for example: CSDN domain name website. In this case, the message header proxy must be set. Otherwise, an error will be reported: like HTTP error fetching URL. Status=403. Wait for http status exception. For specific HTTP status return codes, please refer to the last part, or the recommended book

[java] view plain copy

  1. /**
  2. * Get html through the request address
  3. */
  4. public static void parseRequestUrl(String url) throws IOException{
  5. Connection con = Jsoup.connect(url);//Get the request connection
  6. //                // MIME type acceptable to the browser.
  7. // con.header("Accept", "text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8");
  8. // con.header("Accept-Encoding", "gzip, deflate");
  9. // con.header("Accept-Language", "zh-cn,zh;q=0.8,en- us;q=0.5,en;q=0.3");
  10. // con.header("Connection", "keep-alive");
  11. // con.header(" Host", url);
  12. // con.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0");
  13. Document doc=con.get();
  14. Elements hrefs=doc.select("a[href=/kff517]"); //Attributes behind the node are not required
  15. Elements test=doc.select("html body div#container div#body div#main div.main div#article_details.details div.article_manage span.link_view");
  16. System.out.println(hrefs) ;
  17. System.out.println(test.text());//==.html Gets the text in the node, similar to the method in js
  18. }

3: Parse a local html file. This is similar, but change the way to obtain DOCUMENT.


Collected some information about HTTP message headers:

GET /simple.htm HTTP/1.1 ---Request method, request object, request http protocol
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */* --Refers to the Content-type that the browser can receive
Accept -Language: zh-cn ---Receive language
Accept-Encoding: gzip, deflate --Receive encoding
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1 ; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727) Related information of this machine, including browser type, operating system information, etc. Many websites can display the browser and Operating system version, because this information can be obtained from here.
Host: localhost:8080 Host and port, generally refers to the domain name on the Internet
Connection: Keep-Alive Whether a persistent connection is required


Server The complete HTTP message sent back is as follows:
HTTP/1.1 200 OK ---HTTP/1.1 indicates the protocol used. 200OK refers to the status code returned by the server, which normally returns
Server: Microsoft-IIS/5.1
X-Powered-By: ASP.NET
Date: Fri, 03 Mar 2006 06 :34:03 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Fri, 03 Mar 2006 06:33:18 GMT< ;CR>
ETag: "5ca4f75b8c3ec61:9ee"
Content-Length: 37

hello world

Note: was added by me to represent a line break, it can be deleted, it is meaningless

Overview of HTTP request headers
HTTP client program ( For example, a browser must specify the request type (usually GET or POST) when sending a request to the server. If necessary, the client program can also choose to send other request headers. Most request headers are not required, except Content-Length. For POST requests, Content-Length must appear.
The following are some of the most common request headers.

Accept: The MIME type accepted by the browser.
Accept-Charset: The character set acceptable to the browser
Accept-Encoding: The data encoding method that the browser can decode, such as gzip. Servlet can return gzip-encoded HTML pages to browsers that support gzip. . In many cases this can reduce download time by 5 to 10 times.
Accept-Language: The language type desired by the browser, used when the server can provide more than one language version.
Authorization: Authorization information usually appears in the response to the WWW-Authenticate header sent by the server.
Connection: Indicates whether a persistent connection is required. If the Servlet sees that the value here is "Keep-Alive", or the request uses HTTP 1.1 (HTTP 1.1 uses persistent connections by default. It can take advantage of persistent connections to significantly reduce the download time when the page contains multiple elements (such as Applets, images). To achieve this, the Servlet needs to send a Content-Length header in the response. The simplest way to achieve this is to first write the content to a ByteArrayOutputStream, and then calculate its size before officially writing the content out.
Content-Length: Indicates the length of the request message body.
Cookie: This is one of the most important request header information
From: The email address of the request sender, which is used by some special web client programs and will not be used by the browser.
Host: The host and port in the initial URL.
If-Modified-Since: Return the requested content only if it has been modified after the specified date, otherwise return a 304 "Not Modified" response.
Pragma: Specifying a "no-cache" value means that the server must return a refreshed document, even if it is a proxy server and already has a local copy of the page.
Referer: Contains a URL from which the user accesses the currently requested page.
User-Agent: Browser type, this value is very useful if the content returned by the Servlet is related to the browser type.
UA-Pixels, UA-Color, UA-OS, UA-CPU: Non-standard request headers sent by certain versions of IE browsers, indicating screen size, color depth, operating system and CPU type.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn