用Python编写简单的微博爬虫-Python教程-PHP中文网

首页

后端开发

Python教程

用Python编写简单的微博爬虫

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 10, 2016 pm 03:05 PM

python

先说点题外话，我一开始想使用Sina Weibo API来获取微博内容，但后来发现新浪微博的API限制实在太多，大家感受一下：

只能获取当前授权的用户（就是自己），而且只能返回最新的5条，WTF！
所以果断放弃掉这条路，改为『生爬』，因为PC端的微博是Ajax的动态加载，爬取起来有些困难，我果断知难而退，改为对移动端的微博进行爬取，因为移动端的微博可以通过分页爬取的方式来一次性爬取所有微博内容，这样工作就简化了不少。

最后实现的功能：

1、输入要爬取的微博用户的user_id，获得该用户的所有微博
2、文字内容保存到以%user_id命名文本文件中，所有高清原图保存在weibo_image文件夹中
具体操作：
首先我们要获得自己的cookie，这里只说chrome的获取方法。

1、用chrome打开新浪微博移动端
2、option+command+i调出开发者工具
3、点开Network，将Preserve log选项选中
4、输入账号密码，登录新浪微博

5、找到m.weibo.cn->Headers->Cookie，把cookie复制到代码中的#your cookie处

然后再获取你想爬取的用户的user_id，这个我不用多说啥了吧，点开用户主页，地址栏里面那个号码就是user_id

将python代码保存到weibo_spider.py文件中
定位到当前目录下后，命令行执行python weibo_spider.py user_id
当然如果你忘记在后面加user_id，执行的时候命令行也会提示你输入

最后执行结束

小问题：在我的测试中，有的时候会出现图片下载失败的问题，具体原因还不是很清楚，可能是网速问题，因为我宿舍的网速实在太不稳定了，当然也有可能是别的问题，所以在程序根目录下面，我还生成了一个userid_imageurls的文本文件，里面存储了爬取的所有图片的下载链接，如果出现大片的图片下载失败，可以将该链接群一股脑导进迅雷等下载工具进行下载。

另外，我的系统是OSX EI Capitan10.11.2，Python的版本是2.7，依赖库用sudo pip install XXXX就可以安装，具体配置问题可以自行stackoverflow，这里就不展开讲了。

下面我就给出实现代码

#-*-coding:utf8-*-

import re
import string
import sys
import os
import urllib
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree

reload(sys) 
sys.setdefaultencoding('utf-8')
if(len(sys.argv)>=2):
  user_id = (int)(sys.argv[1])
else:
  user_id = (int)(raw_input(u"请输入user_id: "))

cookie = {"Cookie": "#your cookie"}
url = 'http://weibo.cn/u/%d&#63;filter=1&page=1'%user_id

html = requests.get(url, cookies = cookie).content
selector = etree.HTML(html)
pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])

result = "" 
urllist_set = set()
word_count = 1
image_count = 1

print u'爬虫准备就绪...'

for page in range(1,pageNum+1):

 #获取lxml页面
 url = 'http://weibo.cn/u/%d&#63;filter=1&page=%d'%(user_id,page) 
 lxml = requests.get(url, cookies = cookie).content

 #文字爬取
 selector = etree.HTML(lxml)
 content = selector.xpath('//span[@class="ctt"]')
 for each in content:
  text = each.xpath('string(.)')
  if word_count>=4:
   text = "%d :"%(word_count-3) +text+"\n\n"
  else :
   text = text+"\n\n"
  result = result + text
  word_count += 1

 #图片爬取
 soup = BeautifulSoup(lxml, "lxml")
 urllist = soup.find_all('a',href=re.compile(r'^http://weibo.cn/mblog/oripic',re.I))
 first = 0
 for imgurl in urllist:
  urllist_set.add(requests.get(imgurl['href'], cookies = cookie).url)
  image_count +=1

fo = open("/Users/Personals/%s"%user_id, "wb")
fo.write(result)
word_path=os.getcwd()+'/%d'%user_id
print u'文字微博爬取完毕'

link = ""
fo2 = open("/Users/Personals/%s_imageurls"%user_id, "wb")
for eachlink in urllist_set:
 link = link + eachlink +"\n"
fo2.write(link)
print u'图片链接爬取完毕'


if not urllist_set:
 print u'该页面中不存在图片'
else:
 #下载图片,保存在当前目录的pythonimg文件夹下
 image_path=os.getcwd()+'/weibo_image'
 if os.path.exists(image_path) is False:
  os.mkdir(image_path)
 x=1
 for imgurl in urllist_set:
  temp= image_path + '/%s.jpg' % x
  print u'正在下载第%s张图片' % x
  try:
   urllib.urlretrieve(urllib2.urlopen(imgurl).geturl(),temp)
  except:
   print u"该图片下载失败:%s"%imgurl
  x+=1

print u'原创微博爬取完毕，共%d条，保存路径%s'%(word_count-4,word_path)
print u'微博图片爬取完毕，共%d张，保存路径%s'%(image_count-1,image_path)

一个简单的微博爬虫就完成了，希望对大家的学习有所帮助。

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

Python是否列表动态阵列或引擎盖下的链接列表？May 07, 2025 am 12:16 AM

pythonlistsareimplementedasdynamicarrays，notlinkedlists.1）他们areStoredIncoNtiguulMemoryBlocks，mayrequireRealLealLocationWhenAppendingItems，EmpactingPerformance.2）LinkesedlistSwoldOfferefeRefeRefeRefeRefficeInsertions/DeletionsButslowerIndexeDexedAccess，Lestpypytypypytypypytypy

如何从python列表中删除元素？May 07, 2025 am 12:15 AM

pythonoffersFourmainMethodStoreMoveElement Fromalist：1）删除（值）emovesthefirstoccurrenceofavalue，2）pop（index）emovesanderturnsanelementataSpecifiedIndex，3）delstatementremoveselemsbybybyselementbybyindexorslicebybyindexorslice，and 4）

试图运行脚本时，应该检查是否会遇到'权限拒绝”错误？May 07, 2025 am 12:12 AM

toresolvea“ dermissionded”错误Whenrunningascript，跟随台词：1）CheckAndAdjustTheScript'Spermissions ofchmod xmyscript.shtomakeitexecutable.2）nesureThEseRethEserethescriptistriptocriptibationalocatiforecationAdirectorywherewhereyOuhaveWritePerMissionsyOuhaveWritePermissionsyYouHaveWritePermissions，susteSyAsyOURHomeRecretectory。

与Python的图像处理中如何使用阵列？May 07, 2025 am 12:04 AM

ArraysarecrucialinPythonimageprocessingastheyenableefficientmanipulationandanalysisofimagedata.1)ImagesareconvertedtoNumPyarrays,withgrayscaleimagesas2Darraysandcolorimagesas3Darrays.2)Arraysallowforvectorizedoperations,enablingfastadjustmentslikebri

对于哪些类型的操作，阵列比列表要快得多？May 07, 2025 am 12:01 AM

ArraySaresificatificallyfasterthanlistsForoperationsBenefiting fromDirectMemoryAcccccccCesandFixed-Sizestructures.1）conscessingElements：arraysprovideconstant-timeaccessduetocontoconcotigunmorystorage.2）iteration：araysleveragececacelocality.3）

说明列表和数组之间元素操作的性能差异。May 06, 2025 am 12:15 AM

ArraySareBetterForlement-WiseOperationsDuetofasterAccessCessCessCessCessCessAndOptimizedImplementations.1）ArrayshaveContiguucuulmemoryfordirectAccesscess.2）列出sareflexible butslible dueTopotentEnallymideNamicizing.3）forlarargedAtaTasetsetsetsetsetsetsetsetsetsetsetlib

如何有效地对整个Numpy阵列进行数学操作？May 06, 2025 am 12:15 AM

在NumPy中进行整个数组的数学运算可以通过向量化操作高效实现。 1)使用简单运算符如加法（arr 2）可对数组进行运算。 2)NumPy使用C语言底层库，提升了运算速度。 3)可以进行乘法、除法、指数等复杂运算。 4)需注意广播操作，确保数组形状兼容。 5)使用NumPy函数如np.sum()能显着提高性能。

您如何将元素插入python数组中？May 06, 2025 am 12:14 AM

在Python中，向列表插入元素有两种主要方法：1)使用insert(index,value)方法，可以在指定索引处插入元素，但在大列表开头插入效率低；2)使用append(value)方法，在列表末尾添加元素，效率高。对于大列表，建议使用append()或考虑使用deque或NumPy数组来优化性能。

See all articles