Home >Web Front-end >HTML Tutorial >The handwritten crawler program can run successfully, but the efficiency is too low. It only takes more than ten seconds to crawl a piece of data. Please give me some advice to improve the efficiency. Thank you! ! _html/css_WEB-ITnose
Parser parses html crawler
import .....I used jsoup at the beginning and the efficiency was even lower than this. Just at Document doc = Jsoup.parse(method.getResponseBodyAsString()); I couldn’t walk this step. It was a headache. Someone suggested that I use sax to parse. , but can sax be used to parse html?
Multi-threading to increase bandwidth
**
* 获取**************的数据
* @author wf
*
*/
public class DoMain5 {
public Document getDoc(String url){
Document doc=null;
try {
doc=Jsoup.connect(url).get();
} catch (Exception e) {
System.out.println("文档解析失败!!");
e.printStackTrace();
}
return doc;
}
public static void main(String rags[]){
Dao d = new Dao();
DoMain5 dm = new DoMain5();
String title="";
String section="";
String content="";
String contentTitle="";
int count=630;
String url="******************" ;
if(d.createTable()){
System.out.println("建表成功!!!");
try {
Document doc = dm.getDoc(url);
System.out.println(doc);
Element titles = doc.getElementById("maincontent");
Elements lis=titles.getElementsByTag("li");
//*********************标题****************************
for(int i=1;i
if(a.toString().equals("")){
title=lis.get(i).text();
contentTitle=title;
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题数据插入成功!!!");
System.out.println("*****************" count "*****************");
}else{
System.out.println("第" (i 1) "题节数据插入失败!!!");
System.out.println("*****************" count "*****************");
break;
}
count ;
continue;
}else{
title=a.get(0).text();
url="http:***************" a.get(0).attr("href");
Document doc2=dm.getDoc(url);
Element sections =doc2.getElementById("maincontent");
Elements ls = sections.getElementsByTag("li");
//**********************节************************
for(int j=507;j
if(link.toString().equals("")){
section=ls.get(j).text();
contentTitle=title " " section;
}else{
section = link.get(0).text();
url="http:****************" link.get(0).attr("href");
Document doc3=dm.getDoc(url);
Element contents=doc3.getElementById("maincontent");
content=contents.text();
//处理内容字符串
content=content.substring(content.indexOf("?") "?".length());
content=content.replace("'", "''");
contentTitle=title " " section;
}
System.out.println("****************" count "******************");
System.out.println("正在读第" (i 1) "题" (j 1) "节");
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题" (j 1) "节数据插入成功!!!");
System.out.println("*****************" count "*****************");
count ;
}else{
System.out.println("第" (i 1) "题" (j 1) "节数据插入失败!!!");
System.out.println("*****************" count "*****************");
break;
}
}//end for
}
System.out.println("第" (i 1) "题采集完毕");
break;
}//end for
System.out.println("Collection completed!!");
} catch (Exception e) {
e.printStackTrace();
}
Passed After everyone’s loud suggestions and modifications, the efficiency of this program has been significantly improved, but now it will throw the following two exceptions anytime and anywhere when running. Please give me some advice on how to solve it:
1.java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java :218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http .HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1064)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:373)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:429)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup. helper.HttpConnection.get(HttpConnection.java:153)
at com.wanfang.dousact.DoMain5.getDoc(DoMain5.java:35)
at com.wanfang.dousact.DoMain5.main(DoMain5.java: 61)
2.java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java: 333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect( SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:519)
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
at sun.net. www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient .java:323)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient
(HttpURLConnection.java:852)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect
(HttpURLConnection.java:793)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:718)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection .java:425)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at com.wanfang.dousact.DoMain5.getDoc(DoMain5.java:35)
at com.wanfang.dousact.DoMain5.main (DoMain5.java:87)