Web crawler - How to use java to crawl information and make a ranking system?

Question

I happened to have an interesting project to do while learning java web. Our school requires a credit card for morning runs. The Physical Education Department provides an inquiry website, but does not provide an interface. I want to make a website/WeChat backend to capture information from the school website and store it in a database. Then users can query it through my website/WeChat...

PHP中文网 · Answer

I just said it casually, because I didn’t think of any method.

Use Jsoup to crawl page data, haha

代言 · Answer

Think of a few points, let’s talk briefly:
1. Data capture, you can write your own crawler program, formulate time rules for data crawling, etc.
2. Data processing, capture the content of the web page through jsoup or other Method to extract the effective content of the web page and design the data structure. The student ID should be unique. There can be a student table and a morning running record table, which are related through the student ID. 3. My personal understanding is to sort by the number of times, because after thinking about it, , if sorting by time is unreasonable, because there is no way to judge the real morning running time, then I will just talk by the number of times here. You can directly store the field of the number of runs in the student table, reduce querying through the record table, and improve Efficiency means maintaining this field when data processing is required

三叔 · Answer

Generally speaking, tools like httpclient are used to get the return package and parse the message entity (here refers to the html page). The next step is to use xpath, regular expressions, and methods similar to jQuery to parse. DOM element to get the data you want (such as jsoup package). If it’s still too troublesome, you can use the webmagic framework

巴扎黑 · Answer

Simulate login: Use a browser to open the login page and observe the url that receives the student ID and password; post data to the url when simulating login; parse the Set-cookie field information from the response header;
Data capture: Initiate a get request to the sports data page (bring the cookie field obtained in the previous step), get the response, and then perform regular parsing to obtain the data;

Recommendation: To cache the data that users query each time, for example, for 2 hours, it is recommended to use redis; the database can store the queried data, first get the data from redis, if it cannot be retrieved, simulate login to get new data. As for the database layer, I personally feel that it is not necessary. If it is available, you can also perform data analysis and so on

Web crawler - How to use java to crawl information and make a ranking system?

reply all(4)I'll reply