The content of this article is how to use Python crawler to obtain those valuable blog posts. Now I share it with everyone. Friends in need can refer to the content of this article
Author CDA Data Analyst
There are many wonderful technical blog articles on CSDN, We can crawl it and save it on the local disk, which is very convenient for reading and learning later. Now we will use python to write a crawler code to achieve this purpose.
What we want to do: automatically read blog articles, record titles, and save favorite articles to personal Save it on your computer hard drive for future study reference.
The process is roughly divided into the following steps:
1. Find the crawled target URL;
2. Analyze the web page, Find the information you want to save. Here we mainly save the content of blog articles;
3. Clean and organize the crawled information and save it on the local disk. .
Open the csdn web page. As an example, we randomly open a web page:
http://blog.csdn.net/u013088062/article/list/1.
It can be seen that the blogger is very interested in "C++ Convolutional Neural Network" and other articles about computer science. Well written.
##
The crawler code is divided into three categories (classes) according to the idea. The following three with "#" give the beginning of each class respectively (the specific code is attached for everyone to actually run and implement):
##The "class" method belongs to Python's object-oriented programming, which is sometimes better than the process-oriented programming we usually use. Convenient, object-oriented programming is often used in large projects. For beginners, object-oriented programming is not easy to master, but after learning and getting used to it, they will gradually transition from process-oriented to object-oriented programming.
Special note is that the RePage class mainly uses regular expressions to process information obtained from web pages. Regular expressions Set the string style as follows:
## Use regular expressions to match the content to be crawled, which can be achieved using Python and other software tools. There are many rules for regular expressions, and each software uses them in similar ways. Making good use of regular expressions is an important part of crawlers and text mining.
## Use python to write crawler code, which is concise and efficient. This article only explains the most basic usage of the crawler. Interested friends can download the code and take a look. I hope you will gain something from it. Attached is the relevant Python code: ##(.+?) ') Related recommendations: Development case of php implementing a simple crawler Python crawler browser identification library Record a simple Python crawler instance <span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 1</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#-*-coding:UTF-8-*-</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 2</span><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">import</span> re<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 3</span><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">import</span> urllib2<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 4</span><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">import</span> sys<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 5</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Purpose : Read blog articles, record titles, and save article content in Htnl format </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 6</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"># Version: python2.7.13</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 7</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Function: Read web page content</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 8</span><span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">class</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">GetHtmlPage</span><span style="font-size:inherit;color:inherit;line-height:inherit;">()</span>: </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> 9</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Note the case</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">10</span> <span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">def</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">__init__</span> <span style="font-size:inherit;color:inherit;line-height:inherit;">(self,strPage)</span>:</span><br>##11<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>##13<br> = URLLIB2.Request (Self.StRPAPGE) <span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"># Create page request </span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span> 11 15 #Ate "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0"<br>)<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">16</span> to urllib2. ## <span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"># Web page encoding</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">20</span> #except<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"> urllib2.URLError, e: </span>#Catch exception<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span>23<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"> code</span><br>24<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span><span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">26</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">print</span> <br>'HTTP Error:'<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> + e.reason</span><br>27<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>##28<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> </span>return<br> rePage<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span>29<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"></span>#Regular expression, get the desired content<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">30</span><span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">class</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">RePage</span><span style="font-size:inherit;color:inherit;line-height:inherit;">()</span>:</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">31</span><span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"># Regular expression extracts content and returns linked list </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">32</span> <span style="font-size:inherit;color:inherit;line-height:inherit;"><span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">def</span> <span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);">GetReText</span><span style="font-size:inherit;color:inherit;line-height:inherit;">(self,page,recode)</span>:</span><br>##33<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> rePage = re.findall(recode,page,re.S)</span><br>34<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>return<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> rePage</span> <br>35<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span>#Save text<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span><br>36<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"></span><span style="font-size:inherit;color:inherit;line-height:inherit;">class<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> </span>SaveText<span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);"></span>()<span style="font-size:inherit;color:inherit;line-height:inherit;">:</span></span><br>37<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span><span style="font-size:inherit;color:inherit;line-height:inherit;">def<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> </span>Save<span style="font-size:inherit;line-height:inherit;color:rgb(239,239,143);"></span>(self,text,tilte)<span style="font-size:inherit;color:inherit;line-height:inherit;">:</span></span><br>38<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>try<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">:</span><br>39<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> t=</span>"blog\\"<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">+tilte+</span>".html"<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);"></span><br>40<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> because 42</span> <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);"></span>45<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">if</span> __name__ == <br>"__main__"<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">:</span><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">46</span> s = SaveText()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">47</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#File encoding</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">48</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Character decoding correctly</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">49</span> reload(sys)<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">50</span> sys.setdefaultencoding( <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"utf-8"</span> ) <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the default encoding of the system </span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">51</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the web page</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">52</span> page = GetHtmlPage(<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"http://blog.csdn.net/ u013088062/article/list/1"</span>)<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">53</span> htmlPage = page.GetPage()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">54</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Extract content</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">55</span> reServer = RePage()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">56</span> reBlog = reServer.GetReText(htmlPage,<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">r'<span class="link_title">.*?(\s.+?)</span>'</span>) <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the URL link and title</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">57</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Go down to get the text</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">58</span> <span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">for</span> ref <span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);">in</span> reBlog:<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">59</span> pageHeard = <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"http://blog.csdn.net/"</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Add link header</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">60</span> strPage = pageHeard+ref[<span style="font-size:inherit;line-height:inherit;color:rgb(140,208,211);">0</span>]<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">61</span> tilte=ref[<span style="font-size:inherit;line-height:inherit;color:rgb(140,208,211);">1 </span>].replace(<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">'<span style="color:#FF0000;">[Top]</span>'</span>, <span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">""</span>) <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Use the replacement function to remove complicated English</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">62</span> tilte=tilte.replace(<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">"\r\n"</span>,<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">""</span>).lstrip().rstrip()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">63</span> <span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);">#Get the text</span><br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">64</span> htmlPage = GetHtmlPage(strPage)<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">65</span> htmlPageData = htmlPage.GetPage()<br><span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;">66</span> reBlogText = reServer.GetReText(htmlPageData,<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">'</span></span>
<br>67<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>#Save file<span style="font-size:inherit;line-height:inherit;color:rgb(127,159,127);"></span><br>68<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> </span>for<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> s1 </span>in<span style="font-size:inherit;line-height:inherit;color:rgb(227,206,171);"> reBlogText :</span><br>69<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> s1=</span>'\n'<span style="font-size:inherit;line-height:inherit;color:rgb(204,147,147);">+s1</span><br>70<span style="padding-right:20px;font-size:inherit;line-height:inherit;color:rgb(140,208,211);word-spacing:0px;"> s.Save(s1,tilte)</span><br>
The above is the detailed content of How to use Python crawler to get those valuable blog posts. For more information, please follow other related articles on the PHP Chinese website!

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Chinese version
Chinese version, very easy to use

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

Dreamweaver Mac version
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.