Home  >  Article  >  Backend Development  >  Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

黄舟
黄舟Original
2017-10-09 10:25:352157browse

This article mainly introduces to you the relevant information about Python3 practical crawler to capture NetEase Cloud music hot reviews. The article introduces it in detail through the example code. It has certain reference learning value for everyone's study or work. It needs Friends, please follow the editor to learn together.

Preface

# I just got started with python crawler, and I haven’t written python for about half a month, and I almost forgot about it. So I was going to write a simple crawler to practice. I felt that the best features of NetEase Cloud Music were its accurate song recommendations and unique user reviews, so I wrote this method to capture the hot reviews in NetEase Cloud Music’s hot song list. reptile. I am also just getting started with crawling. If you have any comments or questions, please feel free to raise them. Let’s make progress together.

No more nonsense~ Let’s take a look at the detailed introduction.

Our goal is to crawl the popular comments of all songs in the hot song rankings in NetEase Cloud.

This can not only reduce the workload we need to crawl, but also save high-quality comments.

Implementation Analysis

First, we open the NetEase Cloud web version, as shown in the figure:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Click on the ranking list, and then click on the cloud music hot song list on the left, as shown in the picture:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Let’s first open a song at random and find out how to grab the specified The method of popular song reviews of songs is as shown in the picture. I chose a song that I like recently as an example:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

After entering, we will see the song review in Below this page, next we have to find a way to get these comments.

Next open the web console (open the developer tools for chrom, it should be similar for other browsers), press F12 under chrom, as shown in the picture:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Select Network, and then we press F5 to refresh. The data obtained after refreshing is as shown below:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

You can see that the browser sent a lot of information , so which one do we want? Here we can make a preliminary judgment through the status code. The status code marks the status of the server request. The status code here is 200, which means the request is normal, and 304, which means it is abnormal (there are many types of status codes. If you want If you want to know more about it, you can search it yourself. I won’t talk about the specific meaning of 304 here). So we generally only need to look at requests with status code 200. Also, we can roughly observe what information the server returns (or view the response) through the preview in the right column. By combining these two methods, we can quickly find the request we want to analyze. After repeated searches, I finally found a request containing a song review, as shown in the picture:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Maybe the screenshot is not very clear on CSDN, we found a request with the name R_SO_4_489998494 A song review containing this song was found in the POST request of ?csrf_token=. We send this block screenshot so that we can see it more clearly:

Request basic information:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Request header:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Form data in the request:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

We can see that it contains this The request URL for song reviews is http://music.163.com/weapi/v1/resource/comments/R_SO_4_489998494?csrf_token=. After changing a few songs, we found that the first part of the request is the same. It's just that the string of numbers immediately following R_SO_4_ is different. We can infer that each song has a specified id, and what follows R_SO_4_ is the id of the song.

Let's look at the submitted form data again. We will find that two data need to be filled in the form, named params and encSecKey. What follows is a large string of characters. If you change a few songs, you will find that the params and encSecKey of each song are different. Therefore, these two data may have been encrypted by a specific algorithm.

The data related to comments returned by the server is in json format, which contains very rich information (such as information about the commentator, comment date, number of likes, comment content, etc.), among which hotComments is our There are a total of 15 popular comments we are looking for, as shown in the picture:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

#At this point, we have determined the direction, that is, we only need to determine the two parameters params and encSecKey. Just value. But these two parameters are encrypted through a specific algorithm, what should we do? I discovered a pattern, http://music.163.com/weapi/v1/resource/comments/R_SO_4_489998494?csrf_token= The number after R_SO_4_ is the id value of this song, and for different songs, the param and encSecKey value, if you pass these two parameter values ​​​​of a song such as A to song B, then for the same number of pages, this parameter is universal, that is, the two parameter values ​​​​of the first page of A are passed to song B. For the two parameters of any other song, you can get the comments on the first page of the corresponding song. The same is true for the second page, third page, etc.

In fact, we only need to get the 15 popular comments on the first page, so we just need to find a song and add the params and encSecKey in the request on the first page of the song Copy these two parameter values ​​and you can use them.

Regarding how to decrypt these two parameters, there is actually an answer on the powerful Zhihu. Friends who are interested can go in and take a look (https://www.zhihu.com/question/ 36081767), we only need to use our lazy method to complete the requirements here, xixi.

So far, we have analyzed how to capture the popular comments of NetEase Cloud Music. Let’s analyze how to obtain the information of all songs in the cloud music hot song list.

We need to obtain the song names and corresponding id values ​​of all songs in the cloud music hot song list.

Similar to the above analysis steps, we first enter the URL of the hot song list, as shown in the figure:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Press F12 to enter the WEB workbench , as shown in the figure:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

# We found all the song information of the list in a GET request named toplist?id=3778678.

The information corresponding to the request is as shown in the figure:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Let’s preview the result returned by the request, as shown in the figure:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

We found the code containing song information at line 524 of the code, as shown in the figure:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Therefore, we only need to add the request code to , filter out codes containing information.

We use regular expressions for data filtering here.

By observing the characteristics, we can extract the song information we need through two regular expression filters.

For the first regular expression, we extracted the 525th line of code from all the codes returned by the request.

The first regular expression is as follows:

<ul class="f-hide">
<li>
<a href="/song\?id=\d*?" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >.*</a>
</li>
</ul>

The second regular expression we extract the song information we need in line 524, we need the song of the song name and id, the corresponding regular expressions are as follows:

Get the song name:

<li><a href="/song\?id=\d*?" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >(.*?)</a></li>

Get the song id:

<li><a href="/song\?id=(\d*?)" rel="external nofollow" rel="external nofollow" >.*?</a></li>

At this point, our entire process has been completed After the analysis, go to the code to see the specific details~~

##The code is as follows:


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import re
import urllib.request
import urllib.error
import urllib.parse
import json
def get_all_hotSong():  #获取热歌榜所有歌曲名称和id
 url=&#39;http://music.163.com/discover/toplist?id=3778678&#39; #网易云云音乐热歌榜url
 html=urllib.request.urlopen(url).read().decode(&#39;utf8&#39;) #打开url
 html=str(html)  #转换成str
 pat1=r&#39;<ul class="f-hide"><li><a href="/song\?id=\d*?">.*</a></li></ul>&#39; #进行第一次筛选的正则表达式
 result=re.compile(pat1).findall(html)  #用正则表达式进行筛选
 result=result[0]  #获取tuple的第一个元素

 pat2=r&#39;<li><a href="/song\?id=\d*?">(.*?)</a></li>&#39; #进行歌名筛选的正则表达式
 pat3=r&#39;<li><a href="/song\?id=(\d*?)">.*?</a></li>&#39; #进行歌ID筛选的正则表达式
 hot_song_name=re.compile(pat2).findall(result) #获取所有热门歌曲名称
 hot_song_id=re.compile(pat3).findall(result) #获取所有热门歌曲对应的Id

 return hot_song_name,hot_song_id

def get_hotComments(hot_song_name,hot_song_id):
 url=&#39;http://music.163.com/weapi/v1/resource/comments/R_SO_4_&#39; + hot_song_id + &#39;?csrf_token=&#39; #歌评url
 header={ #请求头部
 &#39;User-Agent&#39;:&#39;Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36&#39;
}
 #post请求表单数据
 data={&#39;params&#39;:&#39;zC7fzWBKxxsm6TZ3PiRjd056g9iGHtbtc8vjTpBXshKIboaPnUyAXKze+KNi9QiEz/IieyRnZfNztp7yvTFyBXOlVQP/JdYNZw2+GRQDg7grOR2ZjroqoOU2z0TNhy+qDHKSV8ZXOnxUF93w3DA51ADDQHB0IngL+v6N8KthdVZeZBe0d3EsUFS8ZJltNRUJ&#39;,&#39;encSecKey&#39;:&#39;4801507e42c326dfc6b50539395a4fe417594f7cf122cf3d061d1447372ba3aa804541a8ae3b3811c081eb0f2b71827850af59af411a10a1795f7a16a5189d163bc9f67b3d1907f5e6fac652f7ef66e5a1f12d6949be851fcf4f39a0c2379580a040dc53b306d5c807bf313cc0e8f39bf7d35de691c497cda1d436b808549acc&#39;}
 postdata=urllib.parse.urlencode(data).encode(&#39;utf8&#39;) #进行编码
 request=urllib.request.Request(url,headers=header,data=postdata)
 reponse=urllib.request.urlopen(request).read().decode(&#39;utf8&#39;)
 json_dict=json.loads(reponse) #获取json
 hot_commit=json_dict[&#39;hotComments&#39;] #获取json中的热门评论


 num=0
 fhandle=open(&#39;./song_comments&#39;,&#39;a&#39;) #写入文件
 fhandle.write(hot_song_name+&#39;:&#39;+&#39;\n&#39;)

 for item in hot_commit:
  num+=1
  fhandle.write(str(num)+&#39;.&#39;+item[&#39;content&#39;]+&#39;\n&#39;)
 fhandle.write(&#39;\n==============================================\n\n&#39;)
 fhandle.close()




hot_song_name,hot_song_id=get_all_hotSong() #获取热歌榜所有歌曲名称和id

num=0
while num < len(hot_song_name): #保存所有热歌榜中的热评
 print(&#39;正在抓取第%d首歌曲热评...&#39;%(num+1))
 get_hotComments(hot_song_name[num],hot_song_id[num])
 print(&#39;第%d首歌曲热评抓取成功&#39;%(num+1))
 num+=1

The results of running the code are as follows:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Compare the song reviews of the song "If I Love You" on the web page with the song reviews we saved:

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture)

The information is correct~

Summary

The above is the detailed content of Python3 implements crawler to capture popular comments analysis of NetEase Cloud Music (picture). For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn