First of all, after reading this article, it does not guarantee that you will become a master, but it can make you understand what a crawler is, how to use a crawler, and how to use the http protocol to invade other people's systems. Of course, it is just some simple tutorials. Get Some simple data;
Let’s start with the code and explain it step by step:
This is a tool class. You don’t need to look at it in detail. You can find tool classes for sending http requests everywhere on the Internet. There are few packages. Guide it yourself
package com.df.util; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.PrintWriter; import java.net.HttpURLConnection; import java.net.URL; import java.net.URLConnection; import java.util.List; import java.util.Map; import org.apache.log4j.Logger; import org.jsoup.Connection; import org.jsoup.Connection.Method; import org.jsoup.Connection.Response; import org.jsoup.Jsoup; import com.df.controller.DFContorller; public class HttpPosts { private final static Logger logger = Logger.getLogger(DFContorller.class); public static String sendPost(String url, String param) { PrintWriter out = null; BufferedReader in = null; String result = ""; try { URL realUrl = new URL(url); // 打开和URL之间的连接 URLConnection conn = realUrl.openConnection(); // 设置通用的请求属性 conn.setRequestProperty("accept", "*/*"); conn.setRequestProperty("connection", "Keep-Alive"); conn.setRequestProperty("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)"); // 发送POST请求必须设置如下两行 conn.setDoOutput(true); conn.setDoInput(true); // 获取URLConnection对象对应的输出流 out = new PrintWriter(conn.getOutputStream()); // 发送请求参数 out.print(param); // flush输出流的缓冲 out.flush(); // 定义BufferedReader输入流来读取URL的响应 in = new BufferedReader( new InputStreamReader(conn.getInputStream(),"utf-8")); String line; while ((line = in.readLine()) != null) { result += line; } } catch (Exception e) { logger.info("发送 POST 请求出现异常!"+e); e.printStackTrace(); } //使用finally块来关闭输出流、输入流 finally{ try{ if(out!=null){ out.close(); } if(in!=null){ in.close(); } } catch(IOException ex){ ex.printStackTrace(); } } return result; } public static String sendGet(String url, String param) { String result = ""; BufferedReader in = null; try { String urlNameString = url + "?" + param; URL realUrl = new URL(urlNameString); // 打开和URL之间的连接 URLConnection connection = realUrl.openConnection(); // 设置通用的请求属性 connection.setRequestProperty("accept", "*/*"); connection.setRequestProperty("connection", "Keep-Alive"); connection.setRequestProperty("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)"); connection.setRequestProperty("Cookie","PHPSESSID=27roc4d0ccd2cg4jbht80k8km2"); // 建立实际的连接 connection.connect(); // 获取所有响应头字段 Map<String, List<String>> map = connection.getHeaderFields(); // 遍历所有的响应头字段 for (String key : map.keySet()) { System.out.println(key + "--->" + map.get(key)); } // 定义 BufferedReader输入流来读取URL的响应 in = new BufferedReader(new InputStreamReader( connection.getInputStream(),"utf-8")); String line; while ((line = in.readLine()) != null) { result += line; } } catch (Exception e) { System.out.println("发送GET请求出现异常!" + e); e.printStackTrace(); } // 使用finally块来关闭输入流 finally { try { if (in != null) { in.close(); } } catch (Exception e2) { e2.printStackTrace(); } } return result; } }
------------------------------------------Separating line
Let’s enter the topic: First of all, you have to enter first. You have to crawl the login page of the website, view the page source code, and find the method name for sending the login request; generally, small websites It will be written directly in the from surface action. It is easy to find. Medium-sized websites will not write it so directly. It takes some effort to find it. It may be in js or not on this page. It is recommended to use a packet capture tool to log in. Once, I looked at the captured request information. It was a large website. I crawled the JD.com backend. I used the f12 that comes with the browser. I couldn't get the login information. It disappeared in a flash. I finally tried a lot of tricks to get it. Jingdong’s login interface; implement crawling; after getting the login interface address; upload the code
String data=HttpPosts.sendGet (login address (without parameters; String type address), parameters (such as: user_id=6853&export =112)); (The returned login status is usually in json format. It will count whether you logged in successfully, some are true, some are 1, depending on the situation) Choose get or post to imitate the request of the login page
Then make another request to get the cookie
Connection conn = Jsoup.connect("登录后页面的地址"); conn.method(Method.GET); conn.followRedirects(false); Response response = conn.execute(); System.out.println(response.cookies());
Let’s talk about the cookie dynamically passed into the get or post method and replace it with the hard-coded cookie; because it is a test, the cookie is hard-coded and can be written dynamically;
Then you want to access the page after login, the home page, or the data page. It must contain cookies and basic parameter information of the http request, otherwise it will definitely be intercepted.
String data=HttpPosts.sendGet(login address (without parameters; String type address), parameters (such as: user_id=6853&export=112)); The access method is the same as above; what is returned to you this time is theirs Page, if you find a data interface on the opposite side, you can directly access it, and the data will be returned directly. Otherwise, you have to parse its page, which is very troublesome. Jsoup is generally used to parse pages.
In fact, this is a different kind of intrusion. You don’t need to know the other party’s interface document. You can use the http protocol to directly access the other party’s server using the program.
The above is the detailed content of Example analysis of Java crawler. For more information, please follow other related articles on the PHP Chinese website!

JVM works by converting Java code into machine code and managing resources. 1) Class loading: Load the .class file into memory. 2) Runtime data area: manage memory area. 3) Execution engine: interpret or compile execution bytecode. 4) Local method interface: interact with the operating system through JNI.

JVM enables Java to run across platforms. 1) JVM loads, validates and executes bytecode. 2) JVM's work includes class loading, bytecode verification, interpretation execution and memory management. 3) JVM supports advanced features such as dynamic class loading and reflection.

Java applications can run on different operating systems through the following steps: 1) Use File or Paths class to process file paths; 2) Set and obtain environment variables through System.getenv(); 3) Use Maven or Gradle to manage dependencies and test. Java's cross-platform capabilities rely on the JVM's abstraction layer, but still require manual handling of certain operating system-specific features.

JVMmanagesgarbagecollectionacrossplatformseffectivelybyusingagenerationalapproachandadaptingtoOSandhardwaredifferences.ItemploysvariouscollectorslikeSerial,Parallel,CMS,andG1,eachsuitedfordifferentscenarios.Performancecanbetunedwithflagslike-XX:NewRa

Java code can run on different operating systems without modification, because Java's "write once, run everywhere" philosophy is implemented by Java virtual machine (JVM). As the intermediary between the compiled Java bytecode and the operating system, the JVM translates the bytecode into specific machine instructions to ensure that the program can run independently on any platform with JVM installed.

The compilation and execution of Java programs achieve platform independence through bytecode and JVM. 1) Write Java source code and compile it into bytecode. 2) Use JVM to execute bytecode on any platform to ensure the code runs across platforms.

Java performance is closely related to hardware architecture, and understanding this relationship can significantly improve programming capabilities. 1) The JVM converts Java bytecode into machine instructions through JIT compilation, which is affected by the CPU architecture. 2) Memory management and garbage collection are affected by RAM and memory bus speed. 3) Cache and branch prediction optimize Java code execution. 4) Multi-threading and parallel processing improve performance on multi-core systems.

Using native libraries will destroy Java's platform independence, because these libraries need to be compiled separately for each operating system. 1) The native library interacts with Java through JNI, providing functions that cannot be directly implemented by Java. 2) Using native libraries increases project complexity and requires managing library files for different platforms. 3) Although native libraries can improve performance, they should be used with caution and conducted cross-platform testing.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

Notepad++7.3.1
Easy-to-use and free code editor

Dreamweaver Mac version
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.
