这个程序其实很早之前就完成了,一直没有发出了,趁着最近不是很忙就分享给大家.
使用BeautifulSoup模块和urllib2模块实现,然后保存成word是使用python docx模块的,安装方式网上一搜一大堆,我就不再赘述了.
主要实现的功能是登陆知乎,然后将个人收藏的问题和答案获取到之后保存为word文档,以便没有网络的时候可以查阅.当然,答案中如果有图片的话也是可以获取到的.不过这块还是有点问题的.等以后有时间了在修改修改吧.
还有就是正则,用的简直不要太烂…鄙视下自己…
还有,现在是问题的话所有的答案都会保存下来的.看看有时间修改成只保存第一个答案或者收藏页问题的答案吧.要不然如果收藏的太多了的话保存下来的word会吓你一跳的哦.O(∩_∩)O哈哈~
在登陆的时候可能会需要验证码,如果提示输入验证码的话在程序的文件夹下面就可以看到验证码的图片,照着输入就ok了.
# -*- coding: utf-8 -*- #登陆知乎抓取个人收藏 然后保存为word import sys reload(sys) sys.setdefaultencoding('utf-8') import urllib import urllib2 import cookielib import string import re from bs4 import BeautifulSoup from docx import Document from docx import * from docx.shared import Inches from sys import exit import os #这儿是因为在公司上网的话需要使用socket代理 #import socks #import socket #socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5,"127.0.0.1",8088) #socket.socket =socks.socksocket loginurl='http://www.zhihu.com/login' headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36',} postdata={ '_xsrf': 'acab9d276ea217226d9cc94a84a231f7', 'email': '', 'password': '', 'rememberme':'y' } if not os.path.exists('myimg'): os.mkdir('myimg') if os.path.exists('123.docx'): os.remove('123.docx') if os.path.exists('checkcode.gif'): os.remove('checkcode.gif') mydoc=Document() questiontitle='' #---------------------------------------------------------------------- def dealimg(imgcontent): soup=BeautifulSoup(imgcontent) try: for imglink in soup.findAll('img'): if imglink is not None : myimg= imglink.get('src') #print myimg if myimg.find('http')>=0: imgsrc=urllib2.urlopen(myimg).read() imgnamere=re.compile(r'http\S*/') imgname=imgnamere.sub('',myimg) #print imgname with open(u'myimg'+'/'+imgname,'wb') as code: code.write(imgsrc) mydoc.add_picture(u'myimg/'+imgname,width=Inches(1.25)) except: pass strinfo=re.compile(r'<noscript>[\s\S]*</noscript>') imgcontent=strinfo.sub('',imgcontent) strinfo=re.compile(r'<img class[\s\S]*</ alt="python实现登陆知乎获得个人收藏并保存为word文件" >') imgcontent=strinfo.sub('',imgcontent) #show all strinfo=re.compile(r'<a class="toggle-expand[\s\S]*</a>') imgcontent=strinfo.sub('',imgcontent) strinfo=re.compile(r'<a class=" wrap external"[\s\S]*rel="nofollow noreferrer" target="_blank">') imgcontent=strinfo.sub('',imgcontent) imgcontent=imgcontent.replace('<i class="icon-external"></i></a>','') imgcontent=imgcontent.replace('</b>','').replace('</p>','').replace('<p>','').replace('<p>','').replace('<br>','') return imgcontent def enterquestionpage(pageurl): html=urllib2.urlopen(pageurl).read() soup=BeautifulSoup(html) questiontitle=soup.title.string mydoc.add_heading(questiontitle,level=3) for div in soup.findAll('div',{'class':'fixed-summary zm-editable-content clearfix'}): #print div conent=str(div).replace('<div class="fixed-summary zm-editable-content clearfix">','').replace('</div>','') conent=conent.decode('utf-8') conent=conent.replace('<br/>','\n') conent=dealimg(conent) ###这一块弄得太复杂了 有时间找找看有没有处理html的模块 conent=conent.replace('<div class="fixed-summary-mask">','').replace('<blockquote>','').replace('<b>','').replace('<strong>','').replace('</strong>','').replace('<em>','').replace('</em>','').replace('</blockquote>','') mydoc.add_paragraph(conent,style='BodyText3') """file=open('222.txt','a') file.write(str(conent)) file.close()""" def entercollectpage(pageurl): html=urllib2.urlopen(pageurl).read() soup=BeautifulSoup(html) for div in soup.findAll('div',{'class':'zm-item'}): h2content=div.find('h2',{'class':'zm-item-title'}) #print h2content if h2content is not None: link=h2content.find('a') mylink=link.get('href') quectionlink='http://www.zhihu.com'+mylink enterquestionpage(quectionlink) print quectionlink def loginzhihu(): postdatastr=urllib.urlencode(postdata) ''' cj = cookielib.LWPCookieJar() cookie_support = urllib2.HTTPCookieProcessor(cj) opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler) urllib2.install_opener(opener) ''' h = urllib2.urlopen(loginurl) request = urllib2.Request(loginurl,postdatastr,headers) request.get_origin_req_host response = urllib2.urlopen(request) #print response.geturl() text = response.read() collecturl='http://www.zhihu.com/collections' req=urllib2.urlopen(collecturl) if str(req.geturl())=='http://www.zhihu.com/?next=%2Fcollections': print 'login fail!' return txt=req.read() soup=BeautifulSoup(txt) count=0 divs =soup.findAll('div',{'class':'zm-item'}) if divs is None: print 'login fail!' return print 'login ok!\n' for div in divs: link=div.find('a') mylink=link.get('href') collectlink='http://www.zhihu.com'+mylink entercollectpage(collectlink) print collectlink #这儿是当时做测试用的,值获取一个收藏 #count+=1 #if count==1: # return def getcheckcode(thehtml): soup=BeautifulSoup(thehtml) div=soup.find('div',{'class':'js-captcha captcha-wrap'}) if div is not None: #print div imgsrc=div.find('img') imglink=imgsrc.get('src') if imglink is not None: imglink='http://www.zhihu.com'+imglink imgcontent=urllib2.urlopen(imglink).read() with open('checkcode.gif','wb') as code: code.write(imgcontent) return True else: return False return False if __name__=='__main__': import getpass username=raw_input('input username:') password=getpass.getpass('Enter password: ') postdata['email']=username postdata['password']=password postdatastr=urllib.urlencode(postdata) cj = cookielib.LWPCookieJar() cookie_support = urllib2.HTTPCookieProcessor(cj) opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler) urllib2.install_opener(opener) h = urllib2.urlopen(loginurl) request = urllib2.Request(loginurl,postdatastr,headers) response = urllib2.urlopen(request) txt = response.read() if getcheckcode(txt): checkcode=raw_input('input checkcode:') postdata['captcha']=checkcode loginzhihu() mydoc.save('123.docx') else: loginzhihu() mydoc.save('123.docx') print 'the end' raw_input()
好了,大概就是这样,大家如果有什么好的建议或者什么的可以再下面留言,我会尽快回复的.或者在小站的关于页面有我的联系方式,直接联系我就ok.

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Dreamweaver Mac version
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

WebStorm Mac version
Useful JavaScript development tools