


Jsoup crawls page data and understands HTTP message headers_html/css_WEB-ITnose
Recommend a book: Hacker Attack and Defense Technology Collection. Web Practical Chapter;
By the way, I leave a question: Is it possible to access the web in large quantities through jsoup or Small nameserver, bringing it down? In fact, friends who are familiar with jsoup can use it to parse URLs to do a very shameless thing (the source code is kept confidential). Haha, let’s briefly introduce JSOUP.
jsoup is a Java-based HTML parser that can directly parse a URL address, HTML text string, and HTML file. It provides a very low-effort API to retrieve and manipulate data through DOM, CSS, and jQuery-like manipulation methods.
Official website download address: http://jsoup.org/download, download core library. Import project
1: Parse HTML text string
[java] view plain copy
- /**
- * Parse an html document. String type
- */
- ublic static void parseStringHtml(String html) {
- Document doc = Jsoup.parse(html );//Convert String into document format
- Elements e=doc.body().getAllElements();//Get all node sets under body
- Elements e1=doc.select ("head");//Get the head node set
- Element e2=doc.getElementById("p");//Get the node with id="p" on the html
- System. out.println(e1);
[java] view plain copy
- /**
- * Get html through the request address
- */
- public static void parseRequestUrl(String url) throws IOException{
- Connection con = Jsoup.connect(url);//Get the request connection
- // // MIME type acceptable to the browser.
- // con.header("Accept", "text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8");
- // con.header("Accept-Encoding", "gzip, deflate");
- // con.header("Accept-Language", "zh-cn,zh;q=0.8,en- us;q=0.5,en;q=0.3");
- // con.header("Connection", "keep-alive");
- // con.header(" Host", url);
- // con.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0");
- Document doc=con.get();
- Elements hrefs=doc.select("a[href=/kff517]"); //Attributes behind the node are not required
- Elements test=doc.select("html body div#container div#body div#main div.main div#article_details.details div.article_manage span.link_view");
- System.out.println(hrefs) ;
- System.out.println(test.text());//==.html Gets the text in the node, similar to the method in js
- }
3: Parse a local html file. This is similar, but change the way to obtain DOCUMENT.
Collected some information about HTTP message headers:
GET /simple.htm HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
Accept -Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1 ; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Host: localhost:8080
Connection: Keep-Alive
Server The complete HTTP message sent back is as follows:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.1
X-Powered-By: ASP.NET
Date: Fri, 03 Mar 2006 06 :34:03 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Fri, 03 Mar 2006 06:33:18 GMT< ;CR>
ETag: "5ca4f75b8c3ec61:9ee"
Content-Length: 37
hello world body>
Note:
Overview of HTTP request headers
HTTP client program ( For example, a browser must specify the request type (usually GET or POST) when sending a request to the server. If necessary, the client program can also choose to send other request headers. Most request headers are not required, except Content-Length. For POST requests, Content-Length must appear.
The following are some of the most common request headers.
Accept: The MIME type accepted by the browser.
Accept-Charset: The character set acceptable to the browser
Accept-Encoding: The data encoding method that the browser can decode, such as gzip. Servlet can return gzip-encoded HTML pages to browsers that support gzip. . In many cases this can reduce download time by 5 to 10 times.
Accept-Language: The language type desired by the browser, used when the server can provide more than one language version.
Authorization: Authorization information usually appears in the response to the WWW-Authenticate header sent by the server.
Connection: Indicates whether a persistent connection is required. If the Servlet sees that the value here is "Keep-Alive", or the request uses HTTP 1.1 (HTTP 1.1 uses persistent connections by default. It can take advantage of persistent connections to significantly reduce the download time when the page contains multiple elements (such as Applets, images). To achieve this, the Servlet needs to send a Content-Length header in the response. The simplest way to achieve this is to first write the content to a ByteArrayOutputStream, and then calculate its size before officially writing the content out.
Content-Length: Indicates the length of the request message body.
Cookie: This is one of the most important request header information
From: The email address of the request sender, which is used by some special web client programs and will not be used by the browser.
Host: The host and port in the initial URL.
If-Modified-Since: Return the requested content only if it has been modified after the specified date, otherwise return a 304 "Not Modified" response.
Pragma: Specifying a "no-cache" value means that the server must return a refreshed document, even if it is a proxy server and already has a local copy of the page.
Referer: Contains a URL from which the user accesses the currently requested page.
User-Agent: Browser type, this value is very useful if the content returned by the Servlet is related to the browser type.
UA-Pixels, UA-Color, UA-OS, UA-CPU: Non-standard request headers sent by certain versions of IE browsers, indicating screen size, color depth, operating system and CPU type.

The roles of HTML, CSS and JavaScript in web development are: HTML is responsible for content structure, CSS is responsible for style, and JavaScript is responsible for dynamic behavior. 1. HTML defines the web page structure and content through tags to ensure semantics. 2. CSS controls the web page style through selectors and attributes to make it beautiful and easy to read. 3. JavaScript controls web page behavior through scripts to achieve dynamic and interactive functions.

HTMLisnotaprogramminglanguage;itisamarkuplanguage.1)HTMLstructuresandformatswebcontentusingtags.2)ItworkswithCSSforstylingandJavaScriptforinteractivity,enhancingwebdevelopment.

HTML is the cornerstone of building web page structure. 1. HTML defines the content structure and semantics, and uses, etc. tags. 2. Provide semantic markers, such as, etc., to improve SEO effect. 3. To realize user interaction through tags, pay attention to form verification. 4. Use advanced elements such as, combined with JavaScript to achieve dynamic effects. 5. Common errors include unclosed labels and unquoted attribute values, and verification tools are required. 6. Optimization strategies include reducing HTTP requests, compressing HTML, using semantic tags, etc.

HTML is a language used to build web pages, defining web page structure and content through tags and attributes. 1) HTML organizes document structure through tags, such as,. 2) The browser parses HTML to build the DOM and renders the web page. 3) New features of HTML5, such as, enhance multimedia functions. 4) Common errors include unclosed labels and unquoted attribute values. 5) Optimization suggestions include using semantic tags and reducing file size.

WebdevelopmentreliesonHTML,CSS,andJavaScript:1)HTMLstructurescontent,2)CSSstylesit,and3)JavaScriptaddsinteractivity,formingthebasisofmodernwebexperiences.

The role of HTML is to define the structure and content of a web page through tags and attributes. 1. HTML organizes content through tags such as , making it easy to read and understand. 2. Use semantic tags such as, etc. to enhance accessibility and SEO. 3. Optimizing HTML code can improve web page loading speed and user experience.

HTMLisaspecifictypeofcodefocusedonstructuringwebcontent,while"code"broadlyincludeslanguageslikeJavaScriptandPythonforfunctionality.1)HTMLdefineswebpagestructureusingtags.2)"Code"encompassesawiderrangeoflanguagesforlogicandinteract

HTML, CSS and JavaScript are the three pillars of web development. 1. HTML defines the web page structure and uses tags such as, etc. 2. CSS controls the web page style, using selectors and attributes such as color, font-size, etc. 3. JavaScript realizes dynamic effects and interaction, through event monitoring and DOM operations.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 Linux new version
SublimeText3 Linux latest version

Dreamweaver CS6
Visual web development tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.