PHP中文网2017-04-17 14:29:26
str.find()
are all acceptableFor general web pages, the above two points are enough. For websites with ajax requests, you may not be able to crawl the content you want. It may be more convenient to find its API.
高洛峰2017-04-17 14:29:26
A tutorial compiled when I was studying in the past:
Python crawler tutorial
高洛峰2017-04-17 14:29:26
Post a scraping script that can be used directly to the subject. The purpose is to obtain the Douban ID and movie title of the movie currently being released on Douban. The script depends on the beautifulsoup library and needs to be installed. Beautifulsoup Chinese documentation
Supplement: If the subject hopes to build a real crawler program that can crawl the site or can customize the crawling of specified pages, it is recommended that the subject study scrapy
Grab the python sample code:
#!/usr/bin/env python
#coding:UTF-8
import urllib
import urllib2
import traceback
from bs4 import BeautifulSoup
from lxml import etree as ET
def fetchNowPlayingDouBanInfo():
doubaninfolist = []
try:
#使用proxy时,请取消屏蔽
# proxy_handler = urllib2.ProxyHandler({"http" : '172.23.155.73:8080'})
# opener = urllib2.build_opener(proxy_handler)
# urllib2.install_opener(opener)
url = "http://movie.douban.com/nowplaying/beijing/"
#设置http-useragent
useragent = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36'}
req = urllib2.Request(url, headers=useragent)
page = urllib2.urlopen(req, timeout=10)
html_doc = page.read()
soup = BeautifulSoup(html_doc, "lxml")
try:
nowplaying_ul = soup.find("p", id="nowplaying").find("ul", class_="lists")
lilist = nowplaying_ul.find_all("li", class_="list-item")
for li in lilist:
doubanid = li["id"]
title = li["data-title"]
doubaninfolist.append({"douban_id" : doubanid, "title" : title, "coverinfolist" : [] })
except TypeError, e:
print('(%s)TypeError: %s.!' % (url, traceback.format_exc()))
except Exception:
print('(%s)generic exception: %s.' % (url, traceback.format_exc()))
except urllib2.HTTPError, e:
print('(%s)http request error code - %s.' % (url, e.code))
except urllib2.URLError, e:
print('(%s)http request error reason - %s.' % (url, e.reason))
except Exception:
print('(%s)http request generic exception: %s.' % (url, traceback.format_exc()))
return doubaninfolist
if __name__ =="__main__":
doubaninfolist = fetchNowPlayingDouBanInfo()
print doubaninfolist
巴扎黑2017-04-17 14:29:26
For simple ones that don’t require a framework, you can check out the requests and beautifulsoup libraries. If you are familiar with python syntax, after reading these two, you can almost write a simple crawler.
Generally companies use crawlers. The ones I have seen mostly use java or python.
高洛峰2017-04-17 14:29:26
A simple crawler with the simplest practical framework. Take a look at the introductory post on the Internet
Recommend scrapy
PHP中文网2017-04-17 14:29:26
There are indeed many articles on the Internet about how to write a simple crawler in Python, but most of these articles can only be regarded as examples, and there are still very few that can be actually applied. I think crawlers are just about getting content, analyzing the content, and then storing it. If you are new to it, you can just Google it. If you want to do in-depth research, you can look for the code on Github and take a look.
I only know a little bit about Python, I hope this helps.
迷茫2017-04-17 14:29:26
Post a code to climb Tmall:
def areaFlow(self, parturl, tablename, date):
while True:
url = parturl + self.lzSession + '&days=' + str(date) + '..' + str(date)
print url
try:
html = urllib2.urlopen(url, timeout=30)
except Exception, ex:
writelog(str(ex))
writelog(str(traceback.format_exc()))
break;
responegbk = html.read()
try:
respone = responegbk.encode('utf8')
except Exception, ex:
writelog(str(ex))
# 如果lzSession过期则会返回errcode:500的错误
if respone.find('"errcode":500') != -1:
print 'nodata'
break;
# 如果时间不对则返回errcode:100的错误
elif respone.find('"errcode":100') != -1:
print 'login error'
self.catchLzsession()
else:
try:
resstr = re.findall(r'(?<=\<)(.*?)(?=\/>)', respone, re.S)
writelog('地域名称 浏览量 访问量')
dictitems = []
for iarea in resstr:
items = {}
areaname = re.findall(r'(?<=name=\\")(.*?)(?=\\")', iarea, re.S)
flowamount = re.findall(r'(?<=浏览量:)(.*?)(?=<)', iarea, re.S)
visitoramount = re.findall(r'(?<=访客数:)(.*?)(?=\\")', iarea, re.S)
print '%s %s %s' % (areaname[0], flowamount[0], visitoramount[0])
items['l_date'] = str(self.nowDate)
items['vc_area_name'] = str(areaname[0])
items['i_flow_amount'] = str(flowamount[0].replace(',', ''))
items['i_visitor_amount'] = str(visitoramount[0].replace(',', ''))
items['l_catch_datetime'] = str(self.nowTime)
dictitems.append(items)
writeInfoLog(dictitems)
insertSqlite(self.sqlite, tablename, dictitems)
break
except Exception,ex:
writelog(str(ex))
writelog(str(traceback.format_exc()))
time.sleep(1)