Home  >  Q&A  >  body text

python菜鸟 想做一个简单的爬虫 求教程

python菜鸟 想做一个简单的爬虫 求教程 ps:一般公司做爬虫采集的话常用什么语言

PHP中文网PHP中文网2742 days ago1284

reply all(21)I'll reply

  • PHP中文网

    PHP中文网2017-04-17 14:29:26

    • Crawling content, usually HTTP requests, requests +1
    • The webpage you crawled down is to do some string processing to get the information you want. beautifulsoup, regular expressions, str.find() are all acceptable

    For general web pages, the above two points are enough. For websites with ajax requests, you may not be able to crawl the content you want. It may be more convenient to find its API.

    reply
    0
  • 高洛峰

    高洛峰2017-04-17 14:29:26

    A tutorial compiled when I was studying in the past:

    Python crawler tutorial

    reply
    0
  • 高洛峰

    高洛峰2017-04-17 14:29:26

    Post a scraping script that can be used directly to the subject. The purpose is to obtain the Douban ID and movie title of the movie currently being released on Douban. The script depends on the beautifulsoup library and needs to be installed. Beautifulsoup Chinese documentation

    Supplement: If the subject hopes to build a real crawler program that can crawl the site or can customize the crawling of specified pages, it is recommended that the subject study scrapy

    Grab the python sample code:

    #!/usr/bin/env python
    #coding:UTF-8
    
    import urllib
    import urllib2
    import traceback
    
    from bs4 import BeautifulSoup
    from lxml import etree as ET
    
    def fetchNowPlayingDouBanInfo():
        doubaninfolist = []
    
        try:
            #使用proxy时,请取消屏蔽
    #         proxy_handler = urllib2.ProxyHandler({"http" : '172.23.155.73:8080'})
    #         opener = urllib2.build_opener(proxy_handler)
    #         urllib2.install_opener(opener)      
    
            url = "http://movie.douban.com/nowplaying/beijing/"
    
            #设置http-useragent
            useragent = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36'}
            req = urllib2.Request(url, headers=useragent)    
    
            page = urllib2.urlopen(req, timeout=10)
            html_doc = page.read()
    
            soup = BeautifulSoup(html_doc, "lxml")
    
            try:
    
                nowplaying_ul = soup.find("p", id="nowplaying").find("ul", class_="lists")
    
                lilist = nowplaying_ul.find_all("li", class_="list-item")
                for li in lilist:
                    doubanid = li["id"]
                    title = li["data-title"]
    
                    doubaninfolist.append({"douban_id" : doubanid, "title" : title, "coverinfolist" : [] })
    
            except TypeError, e:
                print('(%s)TypeError: %s.!' % (url, traceback.format_exc()))
            except Exception:
                print('(%s)generic exception: %s.' % (url, traceback.format_exc()))
    
        except urllib2.HTTPError, e:
            print('(%s)http request error code - %s.' % (url, e.code))
        except urllib2.URLError, e:
            print('(%s)http request error reason - %s.' % (url, e.reason))
        except Exception:
            print('(%s)http request generic exception: %s.' % (url, traceback.format_exc()))
    
        return doubaninfolist
    
    if __name__ =="__main__":
       doubaninfolist = fetchNowPlayingDouBanInfo()
       print doubaninfolist
    

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-17 14:29:26

    For simple ones that don’t require a framework, you can check out the requests and beautifulsoup libraries. If you are familiar with python syntax, after reading these two, you can almost write a simple crawler.


    Generally companies use crawlers. The ones I have seen mostly use java or python.

    reply
    0
  • 大家讲道理

    大家讲道理2017-04-17 14:29:26

    Baidu search python + crawler

    reply
    0
  • 高洛峰

    高洛峰2017-04-17 14:29:26

    A simple crawler with the simplest practical framework. Take a look at the introductory post on the Internet
    Recommend scrapy

    reply
    0
  • PHP中文网

    PHP中文网2017-04-17 14:29:26

    There are indeed many articles on the Internet about how to write a simple crawler in Python, but most of these articles can only be regarded as examples, and there are still very few that can be actually applied. I think crawlers are just about getting content, analyzing the content, and then storing it. If you are new to it, you can just Google it. If you want to do in-depth research, you can look for the code on Github and take a look.

    I only know a little bit about Python, I hope this helps.

    reply
    0
  • 怪我咯

    怪我咯2017-04-17 14:29:26

    You can take a look at my scrapy information

    reply
    0
  • 天蓬老师

    天蓬老师2017-04-17 14:29:26

    Scrapy saves you a lot of time
    There are many examples on github

    reply
    0
  • 迷茫

    迷茫2017-04-17 14:29:26

    Post a code to climb Tmall:

    def areaFlow(self, parturl, tablename, date):
            while True:
                url = parturl + self.lzSession + '&days=' + str(date) + '..' + str(date)
                print url
                try:
                    html = urllib2.urlopen(url, timeout=30)
                except Exception, ex:
                    writelog(str(ex))
                    writelog(str(traceback.format_exc()))
                    break;
                responegbk = html.read()
                try:
                    respone = responegbk.encode('utf8')
                except Exception, ex:
                    writelog(str(ex))
                # 如果lzSession过期则会返回errcode:500的错误
                if respone.find('"errcode":500') != -1:
                    print 'nodata'
                    break;
                # 如果时间不对则返回errcode:100的错误
                elif respone.find('"errcode":100') != -1:
                    print 'login error'
                    self.catchLzsession()
                else:
                    try:
                        resstr = re.findall(r'(?<=\<)(.*?)(?=\/>)', respone, re.S)
                        writelog('地域名称    浏览量    访问量')
                        dictitems = []
                        for iarea in resstr:
                            items = {}
                            areaname = re.findall(r'(?<=name=\\")(.*?)(?=\\")', iarea, re.S)
                            flowamount = re.findall(r'(?<=浏览量:)(.*?)(?=&lt)', iarea, re.S)
                            visitoramount = re.findall(r'(?<=访客数:)(.*?)(?=\\")', iarea, re.S)
                            print '%s %s %s' % (areaname[0], flowamount[0], visitoramount[0])
                            items['l_date'] = str(self.nowDate)
                            items['vc_area_name'] = str(areaname[0])
                            items['i_flow_amount'] = str(flowamount[0].replace(',', ''))
                            items['i_visitor_amount'] = str(visitoramount[0].replace(',', ''))
                            items['l_catch_datetime'] = str(self.nowTime)
                            dictitems.append(items)
                        writeInfoLog(dictitems)
                        insertSqlite(self.sqlite, tablename, dictitems)
                        break
                    except Exception,ex:
                        writelog(str(ex))
                        writelog(str(traceback.format_exc()))
                time.sleep(1)
    

    reply
    0
  • Cancelreply