First of all, after reading this article, it does not guarantee that you will become a master, but it can make you understand what a crawler is, how to use a crawler, and how to use the http protocol to invade other people's systems. Of course, it is just some simple tutorials. Get Some simple data;
Let’s start with the code and explain it step by step:
This is a tool class. You don’t need to look at it in detail. You can find tool classes for sending http requests everywhere on the Internet. There are few packages. Guide it yourself
package com.df.util; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.PrintWriter; import java.net.HttpURLConnection; import java.net.URL; import java.net.URLConnection; import java.util.List; import java.util.Map; import org.apache.log4j.Logger; import org.jsoup.Connection; import org.jsoup.Connection.Method; import org.jsoup.Connection.Response; import org.jsoup.Jsoup; import com.df.controller.DFContorller; public class HttpPosts { private final static Logger logger = Logger.getLogger(DFContorller.class); public static String sendPost(String url, String param) { PrintWriter out = null; BufferedReader in = null; String result = ""; try { URL realUrl = new URL(url); // 打开和URL之间的连接 URLConnection conn = realUrl.openConnection(); // 设置通用的请求属性 conn.setRequestProperty("accept", "*/*"); conn.setRequestProperty("connection", "Keep-Alive"); conn.setRequestProperty("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)"); // 发送POST请求必须设置如下两行 conn.setDoOutput(true); conn.setDoInput(true); // 获取URLConnection对象对应的输出流 out = new PrintWriter(conn.getOutputStream()); // 发送请求参数 out.print(param); // flush输出流的缓冲 out.flush(); // 定义BufferedReader输入流来读取URL的响应 in = new BufferedReader( new InputStreamReader(conn.getInputStream(),"utf-8")); String line; while ((line = in.readLine()) != null) { result += line; } } catch (Exception e) { logger.info("发送 POST 请求出现异常!"+e); e.printStackTrace(); } //使用finally块来关闭输出流、输入流 finally{ try{ if(out!=null){ out.close(); } if(in!=null){ in.close(); } } catch(IOException ex){ ex.printStackTrace(); } } return result; } public static String sendGet(String url, String param) { String result = ""; BufferedReader in = null; try { String urlNameString = url + "?" + param; URL realUrl = new URL(urlNameString); // 打开和URL之间的连接 URLConnection connection = realUrl.openConnection(); // 设置通用的请求属性 connection.setRequestProperty("accept", "*/*"); connection.setRequestProperty("connection", "Keep-Alive"); connection.setRequestProperty("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)"); connection.setRequestProperty("Cookie","PHPSESSID=27roc4d0ccd2cg4jbht80k8km2"); // 建立实际的连接 connection.connect(); // 获取所有响应头字段 Map<String, List<String>> map = connection.getHeaderFields(); // 遍历所有的响应头字段 for (String key : map.keySet()) { System.out.println(key + "--->" + map.get(key)); } // 定义 BufferedReader输入流来读取URL的响应 in = new BufferedReader(new InputStreamReader( connection.getInputStream(),"utf-8")); String line; while ((line = in.readLine()) != null) { result += line; } } catch (Exception e) { System.out.println("发送GET请求出现异常!" + e); e.printStackTrace(); } // 使用finally块来关闭输入流 finally { try { if (in != null) { in.close(); } } catch (Exception e2) { e2.printStackTrace(); } } return result; } }
------------------------------------------Separating line
Let’s enter the topic: First of all, you have to enter first. You have to crawl the login page of the website, view the page source code, and find the method name for sending the login request; generally, small websites It will be written directly in the from surface action. It is easy to find. Medium-sized websites will not write it so directly. It takes some effort to find it. It may be in js or not on this page. It is recommended to use a packet capture tool to log in. Once, I looked at the captured request information. It was a large website. I crawled the JD.com backend. I used the f12 that comes with the browser. I couldn't get the login information. It disappeared in a flash. I finally tried a lot of tricks to get it. Jingdong’s login interface; implement crawling; after getting the login interface address; upload the code
String data=HttpPosts.sendGet (login address (without parameters; String type address), parameters (such as: user_id=6853&export =112)); (The returned login status is usually in json format. It will count whether you logged in successfully, some are true, some are 1, depending on the situation) Choose get or post to imitate the request of the login page
Then make another request to get the cookie
Connection conn = Jsoup.connect("登录后页面的地址"); conn.method(Method.GET); conn.followRedirects(false); Response response = conn.execute(); System.out.println(response.cookies());
Let’s talk about the cookie dynamically passed into the get or post method and replace it with the hard-coded cookie; because it is a test, the cookie is hard-coded and can be written dynamically;
Then you want to access the page after login, the home page, or the data page. It must contain cookies and basic parameter information of the http request, otherwise it will definitely be intercepted.
String data=HttpPosts.sendGet(login address (without parameters; String type address), parameters (such as: user_id=6853&export=112)); The access method is the same as above; what is returned to you this time is theirs Page, if you find a data interface on the opposite side, you can directly access it, and the data will be returned directly. Otherwise, you have to parse its page, which is very troublesome. Jsoup is generally used to parse pages.
In fact, this is a different kind of intrusion. You don’t need to know the other party’s interface document. You can use the http protocol to directly access the other party’s server using the program.
The above is the detailed content of Example analysis of Java crawler. For more information, please follow other related articles on the PHP Chinese website!