Home  >  Q&A  >  body text

Web crawler - Java crawler has obtained the image link but cannot download the image

The code of the corresponding image resource src obtained by the crawler in html is as follows

But when I converted the resource into a link through code to download the image, a 400 error was reported

However, I used chrome to test whether the link existed, and found that the real thing that the other party's website server can recognize is

That is to say, the link to obtain image resources through the web page is
http://www.neofactory.co.jp/i... 2.jpg
However, the link to obtain images normally is
http://www.neofactory.co.jp/i...

Please give me some guidance on what to do next. I checked a lot of information on the Internet, but there is still no solution.
ps: The strange thing is that if I use Firefox, the link above can also get pictures, and I am puzzled.

Code:

public class Image
{

private String urlNeo="";
public String getUrlNeo() {
    return urlNeo;
}
public void setUrlNeo(String urlNeo) {
    this.urlNeo = urlNeo;
}
public String getHtml() throws Exception{
    ArrayList<String> list=new ArrayList<String>();    
    String line="";
    String Html="";
    URL url=new URL(urlNeo);
    URLConnection connection=url.openConnection();
    InputStream in=connection.getInputStream();
    InputStreamReader isr=new InputStreamReader(in);
    BufferedReader br=new BufferedReader(isr);
    while((line=br.readLine())!=null){
        Html+=line;
        list.add(line);
    }
    br.close();
    isr.close();
    in.close();
    return Html;
}
public String getImgSrc() throws Exception{
    String html=getHtml();
    String IMGURL_REG_xpath="//p[1]/p[2]/p[2]/p/node()";
    String imginfomation="";
    JXDocument jxDocument = new JXDocument(html);
    imginfomation=(jxDocument.sel(IMGURL_REG_xpath).toString()).substring(1,jxDocument.sel(IMGURL_REG_xpath).toString().length() - 1);
    return imginfomation;
}
public List<String> getImgXpath() throws Exception{
    String str="";
    String IMGSRC_REG = "img.product.\w.*.jpg";
    List<String> list1=new ArrayList<String>();
    List<String> list2=new ArrayList<String>();
    String listimg = getImgSrc();
    Matcher matcher = Pattern.compile(IMGSRC_REG).matcher(listimg);
    while (matcher.find()) {
        list1.add(matcher.group());
    }
    for(int i=1;i<=(list1.size()/2);i++){
        int j=i*2;
        list2.add(list1.get(j-1));
    }
    return list2;
}
public void download(String admin_no) throws Exception{
    List<String> list=new ArrayList<String>();
    list=getImgXpath();
    for(String img:list){
        System.out.println(img);
        String url="http://www.neofactory.co.jp/"+img;
        URL uri=new URL(url);
        URLConnection con=uri.openConnection();
        con.setConnectTimeout(5000);
        InputStream in=con.getInputStream();
        
        byte[] buf=new byte[1024];
        int length=0;            
        File sf=new File("D:\item_neo_photo\"+admin_no);
        if(!sf.exists()){
            sf.mkdirs();
        }
        String[] a=img.split("/");
        OutputStream os=new FileOutputStream(sf.getPath()+"\"+a[a.length-1]);
        
        while((length=in.read(buf))!=-1){
            os.write(buf, 0, length);
        }
        
        os.close();
        in.close();
    }
}

}

黄舟黄舟2692 days ago527

reply all(2)I'll reply

  • 高洛峰

    高洛峰2017-05-17 10:03:58

    Can’t you just put the domain name + the obtained img src attribute together?

    reply
    0
  • 过去多啦不再A梦

    过去多啦不再A梦2017-05-17 10:03:58

    URL encoding

    reply
    0
  • Cancelreply