Home  >  Article  >  Backend Development  >  How to use Python crawler to get those valuable blog posts

How to use Python crawler to get those valuable blog posts

不言
不言Original
2018-04-04 15:06:561950browse

The content of this article is how to use Python crawler to obtain those valuable blog posts. Now I share it with everyone. Friends in need can refer to the content of this article


How to use Python crawler to get those valuable blog posts

Author CDA Data Analyst


There are many wonderful technical blog articles on CSDN, We can crawl it and save it on the local disk, which is very convenient for reading and learning later. Now we will use python to write a crawler code to achieve this purpose.


What we want to do: automatically read blog articles, record titles, and save favorite articles to personal Save it on your computer hard drive for future study reference.


The process is roughly divided into the following steps:


  • 1. Find the crawled target URL;

  • 2. Analyze the web page, Find the information you want to save. Here we mainly save the content of blog articles;

  • 3. Clean and organize the crawled information and save it on the local disk. .


Open the csdn web page. As an example, we randomly open a web page:

http://blog.csdn.net/u013088062/article/list/1.


It can be seen that the blogger is very interested in "C++ Convolutional Neural Network" and other articles about computer science. Well written.

How to use Python crawler to get those valuable blog posts


##

The crawler code is divided into three categories (classes) according to the idea. The following three with "#" give the beginning of each class respectively (the specific code is attached for everyone to actually run and implement):

How to use Python crawler to get those valuable blog posts

How to use Python crawler to get those valuable blog posts

How to use Python crawler to get those valuable blog posts


##The "class" method belongs to Python's object-oriented programming, which is sometimes better than the process-oriented programming we usually use. Convenient, object-oriented programming is often used in large projects. For beginners, object-oriented programming is not easy to master, but after learning and getting used to it, they will gradually transition from process-oriented to object-oriented programming.


Special note is that the RePage class mainly uses regular expressions to process information obtained from web pages. Regular expressions Set the string style as follows:


How to use Python crawler to get those valuable blog posts

## Use regular expressions to match the content to be crawled, which can be achieved using Python and other software tools. There are many rules for regular expressions, and each software uses them in similar ways. Making good use of regular expressions is an important part of crawlers and text mining.


The SaveText class saves the information locally. The effect is as follows:

How to use Python crawler to get those valuable blog posts


## Use python to write crawler code, which is concise and efficient. This article only explains the most basic usage of the crawler. Interested friends can download the code and take a look. I hope you will gain something from it.

Attached is the relevant Python code:


<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 1</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#-*-coding:UTF-8-*-</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 2</span><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">import</span> re<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 3</span><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">import</span> urllib2<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 4</span><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">import</span> sys<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 5</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Purpose : Read blog articles, record titles, and save article content in Htnl format </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 6</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"># Version: python2.7.13</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 7</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Function: Read web page content</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 8</span><span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">class</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">GetHtmlPage</span><span style="font-size:inherit;color:inherit;line-height:inherit;">()</span>: </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 9</span>             <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Note the case</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">10</span>               <span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">def</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">__init__</span> <span style="font-size:inherit;color:inherit;line-height:inherit;">(self,strPage)</span>:</span><br>##11<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>##13<br>                                                                                                                                          = URLLIB2.Request (Self.StRPAPGE) <span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"># Create page request </span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span> 11 15 #Ate "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0"<br>)<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">16</span>                                                                                                                                                                                               to urllib2. ##                                                                                                                                            <span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"># Web page encoding</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">20</span>                                                                                                                                                                                                                                                                        #except<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"> urllib2.URLError, e:                   </span>#Catch exception<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span>23<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">                                                                                                          code</span><br>24<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">                                                                                                                                                                        </span><span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">26</span>                    <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">print</span> <br>'HTTP Error:'<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> + e.reason</span><br>27<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">                                                                                                                 </span>##28<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">                                                 </span>return<br> rePage<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span>29<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"></span>#Regular expression, get the desired content<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">30</span><span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">class</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">RePage</span><span style="font-size:inherit;color:inherit;line-height:inherit;">()</span>:</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">31</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"># Regular expression extracts content and returns linked list </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">32</span> <span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">def</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">GetReText</span><span style="font-size:inherit;color:inherit;line-height:inherit;">(self,page,recode)</span>:</span><br>##33<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> rePage = re.findall(recode,page,re.S)</span><br>34<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>return<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> rePage</span> <br>35<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span>#Save text<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span><br>36<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span><span style="font-size:inherit;color:inherit;line-height:inherit;">class<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> </span>SaveText<span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);"></span>()<span style="font-size:inherit;color:inherit;line-height:inherit;">:</span></span><br>37<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span><span style="font-size:inherit;color:inherit;line-height:inherit;">def<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> </span>Save<span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);"></span>(self,text,tilte)<span style="font-size:inherit;color:inherit;line-height:inherit;">:</span></span><br>38<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">           </span>try<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">:</span><br>39<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">           t=</span>"blog\\"<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">+tilte+</span>".html"<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);"></span><br>40<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">                                                                                                                                                                                    because 42</span>                                                                                             <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);"></span>45<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">if</span> __name__ == <br>"__main__"<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">:</span><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">46</span> s = SaveText()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">47</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#File encoding</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">48</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Character decoding correctly</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">49</span> reload(sys)<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">50</span> sys.setdefaultencoding( <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"utf-8"</span> ) <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the default encoding of the system </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">51</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the web page</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">52</span> page = GetHtmlPage(<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"http://blog.csdn.net/ u013088062/article/list/1"</span>)<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">53</span> htmlPage = page.GetPage()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">54</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Extract content</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">55</span> reServer = RePage()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">56</span> reBlog = reServer.GetReText(htmlPage,<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">r'<span class="link_title">.*?(\s.+?)</span>'</span>) <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the URL link and title</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">57</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Go down to get the text</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">58</span> <span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">for</span> ref <span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">in</span> reBlog:<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">59</span> pageHeard = <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"http://blog.csdn.net/"</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Add link header</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">60</span> strPage = pageHeard+ref[<span style="font-size:inherit;line-height:inherit;color:rgb(140,208,211);">0</span>]<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">61</span> tilte=ref[<span style="font-size:inherit;line-height:inherit;color:rgb(140,208,211);">1 </span>].replace(<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">'<span style="color:#FF0000;">[Top]</span>'</span>, <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">""</span>) <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Use the replacement function to remove complicated English</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">62</span> tilte=tilte.replace(<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"\r\n"</span>,<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">""</span>).lstrip().rstrip()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">63</span>       <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the text</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">64</span>     htmlPage = GetHtmlPage(strPage)<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">65</span>       htmlPageData = htmlPage.GetPage()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">66</span> reBlogText = reServer.GetReText(htmlPageData,<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">'</span></span>

##(.+?)

')<br>67<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>#Save file<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span><br>68<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>for<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> s1 </span>in<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> reBlogText :</span><br>69<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> s1=</span>'\n'<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">+s1</span><br>70<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> s.Save(s1,tilte)</span><br>

Related recommendations:

Development case of php implementing a simple crawler

Python crawler browser identification library

Record a simple Python crawler instance


The above is the detailed content of How to use Python crawler to get those valuable blog posts. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn