The content of this article is how to use Python crawler to obtain those valuable blog posts. Now I share it with everyone. Friends in need can refer to the content of this article
Author CDA Data Analyst
There are many wonderful technical blog articles on CSDN, We can crawl it and save it on the local disk, which is very convenient for reading and learning later. Now we will use python to write a crawler code to achieve this purpose.
What we want to do: automatically read blog articles, record titles, and save favorite articles to personal Save it on your computer hard drive for future study reference.
The process is roughly divided into the following steps:
1. Find the crawled target URL;
2. Analyze the web page, Find the information you want to save. Here we mainly save the content of blog articles;
3. Clean and organize the crawled information and save it on the local disk. .
Open the csdn web page. As an example, we randomly open a web page:
http://blog.csdn.net/u013088062/article/list/1.
It can be seen that the blogger is very interested in "C++ Convolutional Neural Network" and other articles about computer science. Well written.
##
The crawler code is divided into three categories (classes) according to the idea. The following three with "#" give the beginning of each class respectively (the specific code is attached for everyone to actually run and implement):
##The "class" method belongs to Python's object-oriented programming, which is sometimes better than the process-oriented programming we usually use. Convenient, object-oriented programming is often used in large projects. For beginners, object-oriented programming is not easy to master, but after learning and getting used to it, they will gradually transition from process-oriented to object-oriented programming.
Special note is that the RePage class mainly uses regular expressions to process information obtained from web pages. Regular expressions Set the string style as follows:
## Use regular expressions to match the content to be crawled, which can be achieved using Python and other software tools. There are many rules for regular expressions, and each software uses them in similar ways. Making good use of regular expressions is an important part of crawlers and text mining.
## Use python to write crawler code, which is concise and efficient. This article only explains the most basic usage of the crawler. Interested friends can download the code and take a look. I hope you will gain something from it. Attached is the relevant Python code: ##(.+?) ') Related recommendations: Development case of php implementing a simple crawler Python crawler browser identification library Record a simple Python crawler instance <span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 1</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#-*-coding:UTF-8-*-</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 2</span><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">import</span> re<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 3</span><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">import</span> urllib2<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 4</span><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">import</span> sys<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 5</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Purpose : Read blog articles, record titles, and save article content in Htnl format </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 6</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"># Version: python2.7.13</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 7</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Function: Read web page content</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 8</span><span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">class</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">GetHtmlPage</span><span style="font-size:inherit;color:inherit;line-height:inherit;">()</span>: </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 9</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Note the case</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">10</span> <span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">def</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">__init__</span> <span style="font-size:inherit;color:inherit;line-height:inherit;">(self,strPage)</span>:</span><br>##11<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>##13<br> = URLLIB2.Request (Self.StRPAPGE) <span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"># Create page request </span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span> 11 15 #Ate "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0"<br>)<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">16</span> to urllib2. ## <span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"># Web page encoding</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">20</span> #except<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"> urllib2.URLError, e: </span>#Catch exception<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span>23<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"> code</span><br>24<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span><span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">26</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">print</span> <br>'HTTP Error:'<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> + e.reason</span><br>27<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>##28<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> </span>return<br> rePage<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span>29<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"></span>#Regular expression, get the desired content<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">30</span><span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">class</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">RePage</span><span style="font-size:inherit;color:inherit;line-height:inherit;">()</span>:</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">31</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"># Regular expression extracts content and returns linked list </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">32</span> <span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">def</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">GetReText</span><span style="font-size:inherit;color:inherit;line-height:inherit;">(self,page,recode)</span>:</span><br>##33<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> rePage = re.findall(recode,page,re.S)</span><br>34<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>return<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> rePage</span> <br>35<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span>#Save text<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span><br>36<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span><span style="font-size:inherit;color:inherit;line-height:inherit;">class<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> </span>SaveText<span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);"></span>()<span style="font-size:inherit;color:inherit;line-height:inherit;">:</span></span><br>37<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span><span style="font-size:inherit;color:inherit;line-height:inherit;">def<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> </span>Save<span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);"></span>(self,text,tilte)<span style="font-size:inherit;color:inherit;line-height:inherit;">:</span></span><br>38<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>try<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">:</span><br>39<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> t=</span>"blog\\"<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">+tilte+</span>".html"<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);"></span><br>40<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> because 42</span> <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);"></span>45<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">if</span> __name__ == <br>"__main__"<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">:</span><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">46</span> s = SaveText()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">47</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#File encoding</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">48</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Character decoding correctly</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">49</span> reload(sys)<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">50</span> sys.setdefaultencoding( <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"utf-8"</span> ) <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the default encoding of the system </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">51</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the web page</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">52</span> page = GetHtmlPage(<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"http://blog.csdn.net/ u013088062/article/list/1"</span>)<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">53</span> htmlPage = page.GetPage()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">54</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Extract content</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">55</span> reServer = RePage()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">56</span> reBlog = reServer.GetReText(htmlPage,<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">r'<span class="link_title">.*?(\s.+?)</span>'</span>) <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the URL link and title</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">57</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Go down to get the text</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">58</span> <span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">for</span> ref <span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">in</span> reBlog:<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">59</span> pageHeard = <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"http://blog.csdn.net/"</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Add link header</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">60</span> strPage = pageHeard+ref[<span style="font-size:inherit;line-height:inherit;color:rgb(140,208,211);">0</span>]<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">61</span> tilte=ref[<span style="font-size:inherit;line-height:inherit;color:rgb(140,208,211);">1 </span>].replace(<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">'<span style="color:#FF0000;">[Top]</span>'</span>, <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">""</span>) <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Use the replacement function to remove complicated English</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">62</span> tilte=tilte.replace(<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"\r\n"</span>,<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">""</span>).lstrip().rstrip()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">63</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the text</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">64</span> htmlPage = GetHtmlPage(strPage)<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">65</span> htmlPageData = htmlPage.GetPage()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">66</span> reBlogText = reServer.GetReText(htmlPageData,<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">'</span></span>
<br>67<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>#Save file<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span><br>68<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>for<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> s1 </span>in<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> reBlogText :</span><br>69<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> s1=</span>'\n'<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">+s1</span><br>70<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> s.Save(s1,tilte)</span><br>
The above is the detailed content of How to use Python crawler to get those valuable blog posts. For more information, please follow other related articles on the PHP Chinese website!

TomergelistsinPython,youcanusethe operator,extendmethod,listcomprehension,oritertools.chain,eachwithspecificadvantages:1)The operatorissimplebutlessefficientforlargelists;2)extendismemory-efficientbutmodifiestheoriginallist;3)listcomprehensionoffersf

In Python 3, two lists can be connected through a variety of methods: 1) Use operator, which is suitable for small lists, but is inefficient for large lists; 2) Use extend method, which is suitable for large lists, with high memory efficiency, but will modify the original list; 3) Use * operator, which is suitable for merging multiple lists, without modifying the original list; 4) Use itertools.chain, which is suitable for large data sets, with high memory efficiency.

Using the join() method is the most efficient way to connect strings from lists in Python. 1) Use the join() method to be efficient and easy to read. 2) The cycle uses operators inefficiently for large lists. 3) The combination of list comprehension and join() is suitable for scenarios that require conversion. 4) The reduce() method is suitable for other types of reductions, but is inefficient for string concatenation. The complete sentence ends.

PythonexecutionistheprocessoftransformingPythoncodeintoexecutableinstructions.1)Theinterpreterreadsthecode,convertingitintobytecode,whichthePythonVirtualMachine(PVM)executes.2)TheGlobalInterpreterLock(GIL)managesthreadexecution,potentiallylimitingmul

Key features of Python include: 1. The syntax is concise and easy to understand, suitable for beginners; 2. Dynamic type system, improving development speed; 3. Rich standard library, supporting multiple tasks; 4. Strong community and ecosystem, providing extensive support; 5. Interpretation, suitable for scripting and rapid prototyping; 6. Multi-paradigm support, suitable for various programming styles.

Python is an interpreted language, but it also includes the compilation process. 1) Python code is first compiled into bytecode. 2) Bytecode is interpreted and executed by Python virtual machine. 3) This hybrid mechanism makes Python both flexible and efficient, but not as fast as a fully compiled language.

Useaforloopwheniteratingoverasequenceorforaspecificnumberoftimes;useawhileloopwhencontinuinguntilaconditionismet.Forloopsareidealforknownsequences,whilewhileloopssuitsituationswithundeterminediterations.

Pythonloopscanleadtoerrorslikeinfiniteloops,modifyinglistsduringiteration,off-by-oneerrors,zero-indexingissues,andnestedloopinefficiencies.Toavoidthese:1)Use'i


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Zend Studio 13.0.1
Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft
