Home  >  Article  >  Backend Development  >  Python使用urllib2模块抓取HTML页面资源的实例分享

Python使用urllib2模块抓取HTML页面资源的实例分享

WBOY
WBOYOriginal
2016-06-10 15:05:051086browse

先把要抓取的网络地址列在单独的list文件中

http://www.jb51.net/article/83440.html
http://www.jb51.net/article/83437.html
http://www.jb51.net/article/83430.html
http://www.jb51.net/article/83449.html

然后我们来看程序操作,代码如下:

#!/usr/bin/python

import os
import sys
import urllib2
import re

def Cdown_data(fileurl, fpath, dpath):
 if not os.path.exists(dpath):
  os.makedirs(dpath)
 try:
  getfile = urllib2.urlopen(fileurl) 
  data = getfile.read()
  f = open(fpath, 'w')
  f.write(data)
  f.close()
 except:
 print 

with open('u1.list') as lines:
 for line in lines:
  URI = line.strip()
  if '?' and '%' in URI:
   continue
 elif URI.count('/') == 2:
   continue
  elif URI.count('/') > 2:
   #print URI,URI.count('/')
  try:
    dirpath = URI.rpartition('/')[0].split('//')[1]
    #filepath = URI.split('//')[1].split('/')[1]
    filepath = URI.split('//')[1]
   if filepath:
     print URI,filepath,dirpath
     Cdown_data(URI, filepath, dirpath)
   except:
    print URI,'error'

原文网址为:http://www.diyoms.com/python/1806.html
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn