首頁 >後端開發 >Python教學 >記錄一次用Python寫爬蟲的心得

記錄一次用Python寫爬蟲的心得

高洛峰原創: 2016-11-21 17:05:001513瀏覽

現在網路爬蟲有很多方式可以寫，例如Node.js或Go, 甚至PHP都行，我之所以選擇Python的原因是因為教程多，可以係統學習，因為光懂得使用Html選擇器來爬去頁面是不夠的，我還要想學習一些爬蟲過程中常見的坑，以及一些注意事項，例如修改瀏覽器的Header之類的小技巧。

程式碼註解都很詳細了，其實只要直接閱讀原始碼即可。

這個爬蟲的目的很簡單，爬去某個房產網站的樓盤名字+價格+1張圖片的下載（單純測試文件下載功能），以備之後分析房價走勢而用，為了不給對方服務器增加太多壓力，我只選擇了爬取3個頁面。

我這裡說說幾個需要注意的知識點吧：

#記得修改發送的Headers
聽說默認發送過去的都是帶有python信息的頭，很容易被對方網站檢查出是一個爬蟲機器人，導致IP被封，所以最好讓自己的爬蟲程序像人類一點，但是這個代碼只能起到一般的隱瞞，真的有網站想防止爬蟲，你也是騙不過的，上代碼：

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome",
                "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"},

# html的選擇器，我採用pyquery而不是beautifulsoup
很多書都推薦beautifulsoup,但是作為一個習慣了jquery的人來說，beautifulsoup的語法實在是有點拗口，而且貌似還不支持:first-child等高級複雜的css選擇器模式，或是支持，但是我沒找到，也不是很仔細看文件。

然後我網上找了一下資料，發現很多人推薦pyquery這個庫，自己下來用了一下，發現真的很舒服，所以果斷採用了。

#爬蟲思路
思路其實很簡單：
1.找到某個房產的列表頁，分析第二第三頁的URL結構；
2.獲取每一個列表頁的所有列表條目信息的URL，存入python的set()集合中，之所以用set，是為了去除重複的URL資訊。
3.透過獲取的房子的URL，進入詳情頁，再爬去有價值的字段信息，比如圖片文字之類的。
4.目前我只進行簡單的print資料而已，沒有把獲取的資料存為本地的json或CSV格式，這個之後做吧，to be done.

下面是全部程式碼：

#获取页面对象
from urllib.request import urlopen
from urllib.request import urlretrieve
from pyquery import PyQuery as pq
#修改请求头模块,模拟真人访问
import requests
import time
#引入系统对象
import os

#你自己的配置文件，请将config-sample.py重命名为config.py,然后填写对应的值即可
import config

#定义链接集合，以免链接重复
pages = set()
session = requests.Session()
baseUrl = &#39;http://pic1.ajkimg.com&#39;
downLoadDir = &#39;images&#39;

#获取所有列表页连接
def getAllPages():
    pageList = []
    i = 1
    while(i < 2):
        newLink = &#39;http://sh.fang.anjuke.com/loupan/all/p&#39; + str(i) + &#39;/&#39;
        pageList.append(newLink)
        i = i + 1
    return pageList

def getAbsoluteURL(baseUrl, source):
    if source.startswith("http://www."):
        url = "http://"+source[11:] 
    elif source.startswith("http://"):
        url = source
    elif source.startswith("www."):
        url = "http://"+source[4:] 
    else:
        url = baseUrl+"/"+source 
    if baseUrl not in url:
        return None 
    return url

#这个函数内部的路径按照自己的真实情况来写，方便之后的数据导入
def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory): 
    path = absoluteUrl.replace("www.", "")
    path = path.replace(baseUrl, "")
    path = downloadDirectory+path
    directory = os.path.dirname(path)
    if not os.path.exists(directory): 
        os.makedirs(directory)
    return path

#获取当前页面的所有连接
def getItemLinks(url):
    global pages;
    #先判断是否能获取页面
    try:
        req = session.get(url, headers = config.value[&#39;headers&#39;])
    #这个判断只能判定是不是404或者500的错误，如果DNS没法解析，是无法判定的
    except IOError as e:
        print(&#39;can not reach the page. &#39;)
        print(e)
    
    else: 
        h = pq(req.text)
        #获取第一页的所有房子模块
        houseItems = h(&#39;.item-mod&#39;)
        #从模块中提取我们需要的信息，比如详情页的URL,价格，略缩图等
        #我倾向只获取详情页的URL，然后在详情页中获取更多的信息
        for houseItem in houseItems.items():
            houseUrl = houseItem.find(&#39;.items-name&#39;).attr(&#39;href&#39;)
            #print(houseUrl)
            pages.add(houseUrl)
        
#获取详情页的各种字段，这里可以让用户自己编辑
def getItemDetails(url):
    #先判断是否能获取页面
    try:
        req = session.get(url, headers = config.value[&#39;headers&#39;])
    #这个判断只能判定是不是404或者500的错误，如果DNS没法解析，是无法判定的
    except IOError as e:
        print(&#39;can not reach the page. &#39;)
        print(e)
    else:
        time.sleep(1)
        h = pq(req.text)

        #get title
        housePrice = h(&#39;h1&#39;).text() if h(&#39;h1&#39;) != None else &#39;none&#39;

        #get price
        housePrice = h(&#39;.sp-price&#39;).text() if h(&#39;.sp-price&#39;) != None else &#39;none&#39;

        #get image url
        houseImage = h(&#39;.con a:first-child img&#39;).attr(&#39;src&#39;)
        houseImageUrl = getAbsoluteURL(baseUrl, houseImage)
        if houseImageUrl != None:
            urlretrieve(houseImageUrl, getDownloadPath(baseUrl, houseImageUrl, downLoadDir))     
        # if bsObj.find(&#39;em&#39;,{&#39;class&#39;,&#39;sp-price&#39;}) == None:
        #     housePrice = &#39;None&#39;
        # else:
        #     housePrice = bsObj.find(&#39;em&#39;,{&#39;class&#39;,&#39;sp-price&#39;}).text;
        # if bsObj.select(&#39;.con a:first-child .item img&#39;)== None:
        #     houseThumbnail = &#39;None&#39;
        # else:
        #     houseThumbnail = bsObj.select(&#39;.con a:first-child .item img&#39;);

        


#start to run the code
allPages = getAllPages()

for i in allPages:
    getItemLinks(i)
#此时pages 应该充满了很多url的内容
for i in pages:
    getItemDetails(i)
#print(pages)

陳述：

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

上一篇：.NET Core CLI工具文件dotnet-test下一篇：.NET Core CLI工具文件dotnet-test

看更多

記錄一次用Python寫爬蟲的心得

相關文章