Home >Backend Development >Python Tutorial >python3 crawls WeChat articles

python3 crawls WeChat articles

巴扎黑
巴扎黑Original
2017-07-21 13:46:321594browse

Prerequisite:

python3.4

windows

Function: Search related WeChat articles through Sogou’s WeChat search interface, and import titles and related links into Excel tables Medium

Note: The xlsxwriter module is required, and the program writing time is 2017/7/11, so as to avoid that the program cannot be used later, which may be due to relevant changes made to the website. The program is relatively simple, excluding more than 40 lines of comments.

Title:

Idea: Open the initial Url --> Get the title and link regularly --> Change the page loop in the second step --> Import the obtained title and link into Excel

The first step of the crawler is to do it manually (gossip)

Enter the URL mentioned above, such as input: "image recognition", search, the URL will become "" marked in red It is an important parameter. When type=1, it is searching for official accounts. Regardless, query='search keywords', the keywords have been encoded, and there is also a hidden parameter page=1

when you jump to the second You can see "" when page +search+'&page='+str(page)

search is the keyword to be searched. Use quote() encoding to insert

1 search = urllib.request.quote(search)

page is used for looping
1 for page in range(1,pagenum+1):
2     url = 'http://weixin.sogou.com/weixin?type=2&query='+search+'&page='+str(page)

The complete url has been obtained. Next, access the url and obtain the data (create opener object , add header())

1 import urllib.request
2     header = ('User-Agent','Mozilla/5.0')
3     opener = urllib.request.build_opener()
4     opener.addheaders = [header]
5     urllib.request.install_opener(opener)
6     data = urllib.request.urlopen(url).read().decode()
Get the page content, use regular expression to obtain relevant data

1 import re
2     finddata = re.compile('<a target="_blank" href="(.*?)".*?uigs="article_title_.*?">(.*?)</a>').findall(data)
3     #finddata = [('',''),('','')]
There is interference in the data obtained through regular expression Item (link: 'amp;') and irrelevant item (title: '<...><....>'), use replace() to solve

1 title = title.replace('<em><!--red_beg-->','')
2 title = title.replace('<!--red_end--></em>','')
1 link = link.replace('amp;','')

Save the processed titles and links in the list
1 title_link.append(link)
2 title_link.append(title)
The titles and links searched in this way are obtained Okay, next import Excel

Create Excel first
1 import xlsxwriter
2 workbook = xlsxwriter.Workbook(search+'.xlsx')
3 worksheet = workbook.add_worksheet('微信')

Import the data in title_link into Excel

1 for i in range(0,len(title_link),2):
2     worksheet.write('A'+str(i+1),title_link[i+1])
3     worksheet.write('C'+str(i+1),title_link[i])
4 workbook.close()
Complete code :

 1 '''
 2 python3.4 + windows
 3 羽凡-2017/7/11-
 4 用于搜索微信文章,保存标题及链接至Excel中
 5 每个页面10秒延迟,防止被限制
 6 import urllib.request,xlsxwriter,re,time
 7 '''
 8 import urllib.request
 9 search = str(input("搜索微信文章:"))
10 pagenum = int(input('搜索页数:'))
11 import xlsxwriter
12 workbook = xlsxwriter.Workbook(search+'.xlsx')
13 search = urllib.request.quote(search)
14 title_link = []
15 for page in range(1,pagenum+1):
16     url = 'http://weixin.sogou.com/weixin?type=2&query='+search+'&page='+str(page)
17     import urllib.request
18     header = ('User-Agent','Mozilla/5.0')
19     opener = urllib.request.build_opener()
20     opener.addheaders = [header]
21     urllib.request.install_opener(opener)
22     data = urllib.request.urlopen(url).read().decode()
23     import re
24     finddata = re.compile('<a target="_blank" href="(.*?)".*?uigs="article_title_.*?">(.*?)</a>').findall(data)
25     #finddata = [('',''),('','')]
26     for i in range(len(finddata)):
27         title = finddata[i][1]
28         title = title.replace('<em><!--red_beg-->','')
29         title = title.replace('<!--red_end--></em>','')
30         try:
31             #标题中可能存在引号
32             title = title.replace('&ldquo;','"')
33             title = title.replace('&rdquo;','"')
34         except:
35             pass
36         link = finddata[i][0]
37         link = link.replace('amp;','')
38         title_link.append(link)
39         title_link.append(title)
40     print('第'+str(page)+'页')
41     import time
42     time.sleep(10)
43 worksheet = workbook.add_worksheet('微信')
44 worksheet.set_column('A:A',70)
45 worksheet.set_column('C:C',100)
46 bold = workbook.add_format({'bold':True})
47 worksheet.write('A1','标题',bold)
48 worksheet.write('C1','链接',bold)
49 for i in range(0,len(title_link),2):
50     worksheet.write('A'+str(i+1),title_link[i+1])
51     worksheet.write('C'+str(i+1),title_link[i])
52 workbook.close()
53 print('导入Excel完毕!')

The above is the detailed content of python3 crawls WeChat articles. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn