Home >Java >javaTutorial >Write a Java Zhihu crawler with zero foundation and store the captured content locally (2)

Write a Java Zhihu crawler with zero foundation and store the captured content locally (2)

黄舟
黄舟Original
2016-12-24 11:50:451770browse

We encapsulate these two functions into a FileReaderWriter.java file for subsequent use.
Then we return to Zhihu crawler.
We need to add a function to Zhihu’s Zhihu encapsulation class to format the typesetting when writing to local.

The code is as follows:

public String writeString() {  
        String result = "";  
        result += "问题:" + question + "\r\n";  
        result += "描述:" + questionDescription + "\r\n";  
        result += "链接:" + zhihuUrl + "\r\n";  
        for (int i = 0; i < answers.size(); i++) {  
            result += "回答" + i + ":" + answers.get(i) + "\r\n";  
        }  
        result += "\r\n\r\n";  
        return result;  
}

OK, that’s almost it. Next, change System.out.println in the main method to

The code is as follows:

// 写入本地  
        for (Zhihu zhihu : myZhihu) {  
            FileReaderWriter.writeIntoFile(zhihu.writeString(),  
                    "D:/知乎_编辑推荐.txt", true);  
        }

Run it, and you can see what you originally saw on the console The content has been written into the local txt file:

Write a Java Zhihu crawler with zero foundation and store the captured content locally (2)

At first glance, there is no problem. If you look closely, you will find a problem: there are too many html tags, mainly and
.
We can process these tags during output.
First replace
with rn in the io stream, and then delete all html tags, so that it will look much clearer.

The code is as follows:

public String writeString() {  
    // 拼接写入本地的字符串  
    String result = "";  
    result += "问题:" + question + "\r\n";  
    result += "描述:" + questionDescription + "\r\n";  
    result += "链接:" + zhihuUrl + "\r\n";  
    for (int i = 0; i < answers.size(); i++) {  
        result += "回答" + i + ":" + answers.get(i) + "\r\n\r\n";  
    }  
    result += "\r\n\r\n\r\n\r\n";  
    // 将其中的html标签进行筛选  
    result = result.replaceAll("<br>", "\r\n");  
    result = result.replaceAll("<.*?>", "");  
    return result;  
}

The replaceAll function here can use regular expressions, so all tags are deleted at the end.

The above is the content of writing Java Zhihu crawler with zero foundation to store the captured content locally (2). For more related content, please pay attention to the PHP Chinese website (www.php.cn)!


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn