Home >Backend Development >Python Tutorial >python3 crawls WeChat articles
Prerequisite:
python3.4
windows
Function: Search related WeChat articles through Sogou’s WeChat search interface, and import titles and related links into Excel tables Medium
Note: The xlsxwriter module is required, and the program writing time is 2017/7/11, so as to avoid that the program cannot be used later, which may be due to relevant changes made to the website. The program is relatively simple, excluding more than 40 lines of comments.
Title:
Idea: Open the initial Url --> Get the title and link regularly --> Change the page loop in the second step --> Import the obtained title and link into Excel
The first step of the crawler is to do it manually (gossip)
Enter the URL mentioned above, such as input: "image recognition", search, the URL will become "" marked in red It is an important parameter. When type=1, it is searching for official accounts. Regardless, query='search keywords', the keywords have been encoded, and there is also a hidden parameter page=1
when you jump to the second You can see "" when page +search+'&page='+str(page)
search is the keyword to be searched. Use quote() encoding to insert
1 search = urllib.request.quote(search)page is used for looping
1 for page in range(1,pagenum+1): 2 url = 'http://weixin.sogou.com/weixin?type=2&query='+search+'&page='+str(page)The complete url has been obtained. Next, access the url and obtain the data (create opener object , add header())
1 import urllib.request 2 header = ('User-Agent','Mozilla/5.0') 3 opener = urllib.request.build_opener() 4 opener.addheaders = [header] 5 urllib.request.install_opener(opener) 6 data = urllib.request.urlopen(url).read().decode()Get the page content, use regular expression to obtain relevant data
1 import re 2 finddata = re.compile('<a target="_blank" href="(.*?)".*?uigs="article_title_.*?">(.*?)</a>').findall(data) 3 #finddata = [('',''),('','')]There is interference in the data obtained through regular expression Item (link: 'amp;') and irrelevant item (title: '<...><....>'), use replace() to solve
1 title = title.replace('<em><!--red_beg-->','') 2 title = title.replace('<!--red_end--></em>','')
1 link = link.replace('amp;','')Save the processed titles and links in the list
1 title_link.append(link) 2 title_link.append(title)The titles and links searched in this way are obtained Okay, next import ExcelCreate Excel first
1 import xlsxwriter 2 workbook = xlsxwriter.Workbook(search+'.xlsx')
3 worksheet = workbook.add_worksheet('微信')Import the data in title_link into Excel
1 for i in range(0,len(title_link),2): 2 worksheet.write('A'+str(i+1),title_link[i+1]) 3 worksheet.write('C'+str(i+1),title_link[i]) 4 workbook.close()Complete code :
1 ''' 2 python3.4 + windows 3 羽凡-2017/7/11- 4 用于搜索微信文章,保存标题及链接至Excel中 5 每个页面10秒延迟,防止被限制 6 import urllib.request,xlsxwriter,re,time 7 ''' 8 import urllib.request 9 search = str(input("搜索微信文章:")) 10 pagenum = int(input('搜索页数:')) 11 import xlsxwriter 12 workbook = xlsxwriter.Workbook(search+'.xlsx') 13 search = urllib.request.quote(search) 14 title_link = [] 15 for page in range(1,pagenum+1): 16 url = 'http://weixin.sogou.com/weixin?type=2&query='+search+'&page='+str(page) 17 import urllib.request 18 header = ('User-Agent','Mozilla/5.0') 19 opener = urllib.request.build_opener() 20 opener.addheaders = [header] 21 urllib.request.install_opener(opener) 22 data = urllib.request.urlopen(url).read().decode() 23 import re 24 finddata = re.compile('<a target="_blank" href="(.*?)".*?uigs="article_title_.*?">(.*?)</a>').findall(data) 25 #finddata = [('',''),('','')] 26 for i in range(len(finddata)): 27 title = finddata[i][1] 28 title = title.replace('<em><!--red_beg-->','') 29 title = title.replace('<!--red_end--></em>','') 30 try: 31 #标题中可能存在引号 32 title = title.replace('“','"') 33 title = title.replace('”','"') 34 except: 35 pass 36 link = finddata[i][0] 37 link = link.replace('amp;','') 38 title_link.append(link) 39 title_link.append(title) 40 print('第'+str(page)+'页') 41 import time 42 time.sleep(10) 43 worksheet = workbook.add_worksheet('微信') 44 worksheet.set_column('A:A',70) 45 worksheet.set_column('C:C',100) 46 bold = workbook.add_format({'bold':True}) 47 worksheet.write('A1','标题',bold) 48 worksheet.write('C1','链接',bold) 49 for i in range(0,len(title_link),2): 50 worksheet.write('A'+str(i+1),title_link[i+1]) 51 worksheet.write('C'+str(i+1),title_link[i]) 52 workbook.close() 53 print('导入Excel完毕!')
The above is the detailed content of python3 crawls WeChat articles. For more information, please follow other related articles on the PHP Chinese website!