Home >Web Front-end >HTML Tutorial >The handwritten crawler program can run successfully, but the efficiency is too low. It only takes more than ten seconds to crawl a piece of data. Please give me some advice to improve the efficiency. Thank you! ! _html/css_WEB-ITnose

The handwritten crawler program can run successfully, but the efficiency is too low. It only takes more than ten seconds to crawl a piece of data. Please give me some advice to improve the efficiency. Thank you! ! _html/css_WEB-ITnose

WBOY
WBOYOriginal
2016-06-24 12:25:31891browse

Parser parses html crawler

import .....
/**
* Get **** data
*/
public class DoMain3 {
/**
* Get the page content based on the webpage URL
*/
public String getHtmlString(String url){
String hs="";
try {
URL u = new URL(url);
HttpURLConnection conn = (HttpURLConnection)u.openConnection(); 
conn.setRequestProperty("User-Agent","MSIE 7.0");
StringBuffer HtmlString = new StringBuffer();
BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(),"utf-8"));
String line="";
while((line=br.readLine())!=null){
HtmlString.append(line "n");
}
hs=HtmlString.toString();
System.out.println(url);
} catch (Exception e) {
System.out.println("URL地址加载出错!!");
e.printStackTrace();
}
return hs;
}
public static void main(String rags[]){
Dao d = new Dao();
DoMain3 dm = new DoMain3();
String title="";
String section="";
String content="";
String contentTitle="";
int count=110;

String url="http://*************************" ;
if(d.createTable()){
System.out.println("建表成功!!!");
try {
//加载标题页面
Document doc = Jsoup.parse(dm.getHtmlString(url));
Element titles = doc.getElementById("maincontent");
Elements lis=titles.getElementsByTag("li");
//*********************标题****************************
for(int i=1;i Elements a = lis.get(i).getElementsByTag("a");
if(a.toString().equals("")){
title=lis.get(i).text();
contentTitle=title;
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题数据插入成功!!!");
System.out.println("*****************" count "*****************");
}else{
System.out.println("第" (i 1) "题节数据插入失败!!!");
System.out.println("*****************" count "*****************");
break;
}
count ;
continue;
}else{
title=a.get(0).text();
url="http://****************" a.get(0).attr("href");
//加载章节页面
Document doc2=Jsoup.parse(dm.getHtmlString(url));
Element sections =doc2.getElementById("maincontent");
Elements ls = sections.getElementsByTag("li");
//**********************节************************
for(int j=0;j Elements link = ls.get(j).getElementsByTag("a");
if(link.toString().equals("")){
section=ls.get(j).text();
contentTitle=title " " section;
}else{
section = link.get(0).text();
url="http:*******************" link.get(0).attr("href");
//加载内容页面
Document doc3=Jsoup.parse(dm.getHtmlString(url));
Element contents=doc3.getElementById("maincontent");
content=contents.text();
//处理内容字符串
content=content.substring(content.indexOf("?") "?".length());
content=content.replace("'", "''");
contentTitle=title " " section;
}
System.out.println("****************" count "******************");
System.out.println("正在读第" (i 1) "题" (j 1) "节");


//往数据库插入数据
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题" (j 1) "节数据插入成功!!!");
System.out.println("*****************" count "*****************");
count ;
}else{
System.out.println("第" (i 1) "题" (j 1) "节数据插入失败!!!");
System.out.println("*****************" count "*****************");
break;
}
}//end for
}

System.out.println("No." (i 1) "Question collection completed");


}//end for

System.out.println("Collection completed!!");

} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace( ; , I always pause at these two sentences for a long time when debugging
1.BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "utf-8"))

2.while ((line=br.readLine())!=null){

HtmlString.append(line "n");
}

Use jsoup, it is very simple and easy to crawl

I used jsoup at the beginning and the efficiency was even lower than this. Just at Document doc = Jsoup.parse(method.getResponseBodyAsString()); I couldn’t walk this step. It was a headache. Someone suggested that I use sax to parse. , but can sax be used to parse html?



Multi-threading to increase bandwidth

** 
 * 获取**************的数据 
 * @author wf 
 * 
 */ 
public class DoMain5 { 


public Document getDoc(String url){ 
Document doc=null; 
try { 
doc=Jsoup.connect(url).get(); 
} catch (Exception e) { 
System.out.println("文档解析失败!!"); 
e.printStackTrace(); 

return doc; 


public static void main(String rags[]){ 
Dao d = new Dao(); 
DoMain5 dm = new DoMain5(); 

String title=""; 
String section=""; 
String content=""; 
String contentTitle=""; 
int count=630; 

String url="******************" ; 

if(d.createTable()){ 
System.out.println("建表成功!!!"); 

try { 
Document doc = dm.getDoc(url); 
System.out.println(doc); 
Element titles = doc.getElementById("maincontent"); 
Elements lis=titles.getElementsByTag("li"); 
//*********************标题**************************** 
for(int i=1;i Elements a = lis.get(i).getElementsByTag("a"); 
if(a.toString().equals("")){ 
title=lis.get(i).text(); 
contentTitle=title; 

String data[]={contentTitle,title,section,content,url}; 
if(d.pinsertData(data)){ 
System.out.println("第" (i 1) "题数据插入成功!!!"); 
System.out.println("*****************" count "*****************"); 
}else{ 
System.out.println("第" (i 1) "题节数据插入失败!!!"); 
System.out.println("*****************" count "*****************"); 
break; 

count ; 
continue; 
}else{ 
title=a.get(0).text(); 

url="http:***************" a.get(0).attr("href"); 
Document doc2=dm.getDoc(url); 
Element sections =doc2.getElementById("maincontent"); 
Elements ls = sections.getElementsByTag("li"); 
//**********************节************************ 
for(int j=507;j Elements link = ls.get(j).getElementsByTag("a"); 
if(link.toString().equals("")){ 
section=ls.get(j).text(); 
contentTitle=title " " section; 
}else{ 
section = link.get(0).text(); 
url="http:****************" link.get(0).attr("href"); 
Document doc3=dm.getDoc(url); 
Element contents=doc3.getElementById("maincontent"); 
content=contents.text(); 
//处理内容字符串 
content=content.substring(content.indexOf("?") "?".length()); 
content=content.replace("'", "''"); 
contentTitle=title " " section; 

System.out.println("****************" count "******************"); 
System.out.println("正在读第" (i 1) "题" (j 1) "节"); 


String data[]={contentTitle,title,section,content,url}; 

if(d.pinsertData(data)){ 
System.out.println("第" (i 1) "题" (j 1) "节数据插入成功!!!"); 
System.out.println("*****************" count "*****************"); 
count ; 
}else{ 
System.out.println("第" (i 1) "题" (j 1) "节数据插入失败!!!"); 
System.out.println("*****************" count "*****************"); 
break; 

}//end for 


System.out.println("第" (i 1) "题采集完毕"); 
break; 
}//end for 

System.out.println("Collection completed!!");

} catch (Exception e) {

e.printStackTrace();
}

Passed After everyone’s loud suggestions and modifications, the efficiency of this program has been significantly improved, but now it will throw the following two exceptions anytime and anywhere when running. Please give me some advice on how to solve it:

1.java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java :218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http .HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1064)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:373)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:429)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup. helper.HttpConnection.get(HttpConnection.java:153)
at com.wanfang.dousact.DoMain5.getDoc(DoMain5.java:35)
at com.wanfang.dousact.DoMain5.main(DoMain5.java: 61)

2.java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java: 333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect( SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:519)
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
at sun.net. www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient. (HttpClient.java:233)
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient .java:323)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient
(HttpURLConnection.java:852)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect
(HttpURLConnection.java:793)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:718)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection .java:425)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at com.wanfang.dousact.DoMain5.getDoc(DoMain5.java:35)
at com.wanfang.dousact.DoMain5.main (DoMain5.java:87)

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn