search
HomeBackend DevelopmentPython TutorialShare a Python method to crawl popular comments on NetEase Cloud Music

This article will introduce in detail an example of Python obtaining popular comments on NetEase Cloud Music. It has a very good reference value. Let’s take a look at it with the editor.

I have recently been studying content related to text mining. The so-called clever woman cannot make a meal without straw. If you want to perform text analysis, you must first have text. There are many ways to obtain text, such as downloading ready-made text documents from the Internet, or obtaining data through APIs provided by third parties. But sometimes the data we want cannot be obtained directly because there is no direct download channel or API for us to obtain the data. So what should we do at this time? A better way is to use a web crawler, that is, writing a computer program to pretend to be a user to obtain the desired data. Taking advantage of the efficiency of computers, we can obtain data easily and quickly.

So how to write a crawler? There are many languages ​​​​that can be used to write crawlers, such as Java, php, python, etc. I personally prefer to use python. Because Python not only has built-in powerful network libraries, but also has many excellent third-party libraries. Others have directly built the wheel, and we can just use it. This brings great convenience to writing crawlers. It is no exaggeration to say that you can actually write a small crawler using less than 10 lines of Python code, while using other languages ​​requires writing a lot more code. Being concise and easy to understand is a huge advantage of Python.

Okay, without further ado, let’s get to the main topic today. NetEase Cloud Music has become very popular in recent years. I am a user of NetEase Cloud Music and have been using it for several years. I used to use QQ Music and Kugou. Through my own personal experience, I think the best features of NetEase Cloud Music are its accurate song recommendations and unique user reviews (for the record!!! This is not a soft article) , non-advertising! Just my personal opinion, please don’t comment! Often there will be some comments under a song that have received many likes. Coupled with the fact that NetEase Cloud Music put selected user reviews on the subway a few days ago, NetEase Cloud Music's reviews have become popular again. Therefore, I want to analyze NetEase Cloud’s comments and discover the patterns, especially the common characteristics of some hot comments. With this purpose, I started crawling NetEase Cloud comments.

Python has two built-in network libraries, urllib and urllib2, but these two libraries are not particularly convenient to use, so here we use a well-received third-party library, requests. Using requests, you can achieve more complex crawler work such as setting up agents and simulating logins with just a few lines of code. If pip is already installed, just use pip install requests to install it. The Chinese document address is here: http://docs.python-requests.org/zh_CN/latest/user/quickstart.html. If you have any questions, you can refer to the official document. There will be a very detailed introduction above. As for the two libraries urllib and urllib2, they are also quite useful. I will introduce them to you if I have the opportunity in the future.

Before officially introducing the crawler, let’s first talk about the basic working principle of the crawler. We know that when we open the browser to visit a certain URL, we essentially send a certain request to the server. After the server receives our request, The data will be returned according to our request, and then the data will be parsed through the browser and presented to us. If we use code, we have to skip this step of the browser, send certain data directly to the server, and then retrieve the data returned by the server to extract the information we want. But the problem is that sometimes the server needs to verify the request we send. If it thinks that our request is illegal, it will not return data, or return wrong data. So in order to avoid this situation, we sometimes need to disguise the program as a normal user in order to successfully get a response from the server. How to disguise it? This depends on the difference between users accessing a webpage through a browser and us accessing a webpage through a program. Generally speaking, when we access a web page through a browser, in addition to sending the accessed URL, we will also send additional information to the service, such as headers (header information), etc. This is equivalent to the identity certificate of the request. The server sees it. With this data, we will know that we are accessing it through a normal browser, and the data will be returned to us obediently. So our program has to be like a browser, bringing this information that marks our identity when sending a request, so that we can get the data smoothly. Sometimes, we must be logged in to get some data, so we must simulate login. In essence, logging in through the browser means posting some form information to the server (including user name, password and other information). After the server verifies it, we can log in smoothly. The same is true for the application program. Whatever data the browser posts, we send it as it is. That's it. Regarding simulated login, I will introduce it specifically later. Of course, things sometimes don't go so smoothly, because some websites have anti-crawling measures in place. For example, if access is too fast, the IP address will sometimes be blocked (typically Douban). At this time, we still have to set up a proxy server, that is, change our IP address. If one IP is blocked, change it to another IP. How to do this specifically will be discussed later.

Finally, let me introduce a little trick that I think is very useful in the process of writing crawlers. If you are using Firefox or Chrome, you may have noticed a place called developer tools (chrome) or web console (firefox). This tool is very useful because with it, we can clearly see what information the browser sends and what information the server returns when visiting a website. This information is the key to writing a crawler. Below you will see how useful it can be.

-------------------------------------------------- --------The official starting dividing line------------------------------------- --------------

First open the web version of NetEase Cloud Music, select a song at random and open its web page. Here I take Jay Chou's "Sunny Day" as an example. As shown in Figure 1

Share a Python method to crawl popular comments on NetEase Cloud Music

Figure 1

Next open the web console (for chrom, open the developer tools, if it is another browser The device should be similar), as shown in Figure 2

Share a Python method to crawl popular comments on NetEase Cloud Music

Figure 2

Then at this time we need to click on the network and clear all information, and then click Resend (equivalent to refreshing the browser), so that we can intuitively see what information the browser sent and what information the server responded to. As shown in Figure 3

Share a Python method to crawl popular comments on NetEase Cloud Music

##Figure 3

The data obtained after refreshing is as shown in Figure 4 below:

Share a Python method to crawl popular comments on NetEase Cloud Music

Figure 4

You can see that the browser sends a lot of information, so which one do we want? Here we can make a preliminary judgment through the status code. The status code marks the status of the server request. The status code here is 200, which means the request is normal, and 304, which means it is abnormal (there are many types of status codes. If you want If you want to know more about it, you can search it by yourself. I won’t talk about the specific meaning of 304 here). So we generally only need to look at requests with status code 200. Also, we can roughly observe what information the server returns (or view the response) through the preview in the right column. As shown in Figure 5 below:

Share a Python method to crawl popular comments on NetEase Cloud Music

Figure 5

By combining these two methods, we can quickly find the request we want to analyze. Note that the request URL column in Figure 5 is the URL we want to request. There are two request methods: get and post. Another thing that needs to be focused on is the request header, which contains user-Agent (client information). ), reference (where to jump from) and other information. Generally, we will bring the header information whether it is the get or post method. The header information is shown in Figure 6 below:

Share a Python method to crawl popular comments on NetEase Cloud Music

Figure 6

In addition, it should be noted that: get requests are generally direct Put the request parameters as ? parameter1=value1¶meter2=value2 etc. is sent in this form, so there is no need to bring additional request parameters. Post requests generally need to bring additional parameters instead of directly placing the parameters in the URL, so sometimes we also You need to pay attention to the parameter column. After careful search, we finally found the original comment-related request in http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016?csrf_token=, as shown in Figure 7 below:

Share a Python method to crawl popular comments on NetEase Cloud Music

Figure 7

Click on this request and we find that it is a post request. There are two parameters in the request, one is params, and the other One is encSecKey. The values ​​​​of these two parameters are very long, and it feels like they are encrypted. As shown in Figure 8 below:

Share a Python method to crawl popular comments on NetEase Cloud Music

Figure 8

The data related to comments returned by the server is in json format, which contains Very rich information (such as information about commenters, comment date, number of likes, comment content, etc.), as shown in Figure 9 below: (In fact, hotComments is a hot comment, and comments is an array of comments)

Share a Python method to crawl popular comments on NetEase Cloud Music

Figure 9

At this point, we have determined the direction, that is, we only need to determine the two parameter values ​​​​of params and encSecKey. This problem is troublesome I spent an afternoon trying to figure out the encryption method of these two parameters, but I found a pattern, http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016?csrf_token= The number after R_SO_4_ is the id value of the song. For the param and encSecKey values ​​​​of different songs, if the two parameter values ​​​​of a song such as A are passed to the song B, then for the same page number , this parameter is universal, that is, if the two parameter values ​​​​of the first page of A are passed to the two parameters of any other song, the comments on the first page of the corresponding song can be obtained. For the second page, the third page Pages and so on are similar. But unfortunately, different page number parameters are different. This method can only capture a limited number of pages (of course, it is enough to capture the total number of comments and popular comments). If you want to capture all the data, you must Understand how these two parameter values ​​are encrypted. I thought I didn't understand it, so last night I went to Zhihu to search with this question, and I actually found the answer. So far, we have finished explaining how to capture all the data of NetEase Cloud Music’s comments.

As usual, I uploaded the code last, and it worked in my own test:

#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-
# @Time : 2017/3/28 8:46
# @Author : Lyrichu
# @Email : 919987476@qq.com
# @File : NetCloud_spider3.py
'''
@Description:
网易云音乐评论爬虫,可以完整爬取整个评论
部分参考了@平胸小仙女的文章(地址:https://www.zhihu.com/question/36081767)
post加密部分也给出了,可以参考原帖:
作者:平胸小仙女
链接:https://www.zhihu.com/question/36081767/answer/140287795
来源:知乎
'''
from Crypto.Cipher import AES
import base64
import requests
import json
import codecs
import time

# 头部信息
headers = {
 'Host':"music.163.com",
 'Accept-Language':"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
 'Accept-Encoding':"gzip, deflate",
 'Content-Type':"application/x-www-form-urlencoded",
 'Cookie':"_ntes_nnid=754361b04b121e078dee797cdb30e0fd,1486026808627; _ntes_nuid=754361b04b121e078dee797cdb30e0fd; JSESSIONID-WYYY=yfqt9ofhY%5CIYNkXW71TqY5OtSZyjE%2FoswGgtl4dMv3Oa7%5CQ50T%2FVaee%2FMSsCifHE0TGtRMYhSPpr20i%5CRO%2BO%2B9pbbJnrUvGzkibhNqw3Tlgn%5Coil%2FrW7zFZZWSA3K9gD77MPSVH6fnv5hIT8ms70MNB3CxK5r3ecj3tFMlWFbFOZmGw%5C%3A1490677541180; _iuqxldmzr_=32; vjuids=c8ca7976.15a029d006a.0.51373751e63af8; vjlast=1486102528.1490172479.21; __gads=ID=a9eed5e3cae4d252:T=1486102537:S=ALNI_Mb5XX2vlkjsiU5cIy91-ToUDoFxIw; vinfo_n_f_l_n3=411a2def7f75a62e.1.1.1486349441669.1486349607905.1490173828142; P_INFO=m15527594439@163.com|1489375076|1|study|00&99|null&null&null#hub&420100#10#0#0|155439&1|study_client|15527594439@163.com; NTES_CMT_USER_INFO=84794134%7Cm155****4439%7Chttps%3A%2F%2Fsimg.ws.126.net%2Fe%2Fimg5.cache.netease.com%2Ftie%2Fimages%2Fyun%2Fphoto_default_62.png.39x39.100.jpg%7Cfalse%7CbTE1NTI3NTk0NDM5QDE2My5jb20%3D; usertrack=c+5+hljHgU0T1FDmA66MAg==; Province=027; City=027; _ga=GA1.2.1549851014.1489469781; __utma=94650624.1549851014.1489469781.1490664577.1490672820.8; __utmc=94650624; __utmz=94650624.1490661822.6.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; playerid=81568911; __utmb=94650624.23.10.1490672820",
 'Connection':"keep-alive",
 'Referer':'http://music.163.com/'
}
# 设置代理服务器
proxies= {
 'http:':'http://121.232.146.184',
 'https:':'https://144.255.48.197'
 }

# offset的取值为:(评论页数-1)*20,total第一页为true,其余页为false
# first_param = '{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}' # 第一个参数
second_param = "010001" # 第二个参数
# 第三个参数
third_param = "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
# 第四个参数
forth_param = "0CoJUm6Qyw8W8jud"

# 获取参数
def get_params(page): # page为传入页数
 iv = "0102030405060708"
 first_key = forth_param
 second_key = 16 * 'F'
 if(page == 1): # 如果为第一页
 first_param = '{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}'
 h_encText = AES_encrypt(first_param, first_key, iv)
 else:
 offset = str((page-1)*20)
 first_param = '{rid:"", offset:"%s", total:"%s", limit:"20", csrf_token:""}' %(offset,'false')
 h_encText = AES_encrypt(first_param, first_key, iv)
 h_encText = AES_encrypt(h_encText, second_key, iv)
 return h_encText

# 获取 encSecKey
def get_encSecKey():
 encSecKey = "257348aecb5e556c066de214e531faadd1c55d814f9be95fd06d6bff9f4c7a41f831f6394d5a3fd2e3881736d94a02ca919d952872e7d0a50ebfa1769a7a62d512f5f1ca21aec60bc3819a9c3ffca5eca9a0dba6d6f7249b06f5965ecfff3695b54e1c28f3f624750ed39e7de08fc8493242e26dbc4484a01c76f739e135637c"
 return encSecKey

# 解密过程
def AES_encrypt(text, key, iv):
 pad = 16 - len(text) % 16
 text = text + pad * chr(pad)
 encryptor = AES.new(key, AES.MODE_CBC, iv)
 encrypt_text = encryptor.encrypt(text)
 encrypt_text = base64.b64encode(encrypt_text)
 return encrypt_text

# 获得评论json数据
def get_json(url, params, encSecKey):
 data = {
 "params": params,
 "encSecKey": encSecKey
 }
 response = requests.post(url, headers=headers, data=data,proxies = proxies)
 return response.content

# 抓取热门评论,返回热评列表
def get_hot_comments(url):
 hot_comments_list = []
 hot_comments_list.append(u"用户ID 用户昵称 用户头像地址 评论时间 点赞总数 评论内容\n")
 params = get_params(1) # 第一页
 encSecKey = get_encSecKey()
 json_text = get_json(url,params,encSecKey)
 json_dict = json.loads(json_text)
 hot_comments = json_dict['hotComments'] # 热门评论
 print("共有%d条热门评论!" % len(hot_comments))
 for item in hot_comments:
 comment = item['content'] # 评论内容
 likedCount = item['likedCount'] # 点赞总数
 comment_time = item['time'] # 评论时间(时间戳)
 userID = item['user']['userID'] # 评论者id
 nickname = item['user']['nickname'] # 昵称
 avatarUrl = item['user']['avatarUrl'] # 头像地址
 comment_info = userID + " " + nickname + " " + avatarUrl + " " + comment_time + " " + likedCount + " " + comment + u"\n"
 hot_comments_list.append(comment_info)
 return hot_comments_list

# 抓取某一首歌的全部评论
def get_all_comments(url):
 all_comments_list = [] # 存放所有评论
 all_comments_list.append(u"用户ID 用户昵称 用户头像地址 评论时间 点赞总数 评论内容\n") # 头部信息
 params = get_params(1)
 encSecKey = get_encSecKey()
 json_text = get_json(url,params,encSecKey)
 json_dict = json.loads(json_text)
 comments_num = int(json_dict['total'])
 if(comments_num % 20 == 0):
 page = comments_num / 20
 else:
 page = int(comments_num / 20) + 1
 print("共有%d页评论!" % page)
 for i in range(page): # 逐页抓取
 params = get_params(i+1)
 encSecKey = get_encSecKey()
 json_text = get_json(url,params,encSecKey)
 json_dict = json.loads(json_text)
 if i == 0:
 print("共有%d条评论!" % comments_num) # 全部评论总数
 for item in json_dict['comments']:
 comment = item['content'] # 评论内容
 likedCount = item['likedCount'] # 点赞总数
 comment_time = item['time'] # 评论时间(时间戳)
 userID = item['user']['userId'] # 评论者id
 nickname = item['user']['nickname'] # 昵称
 avatarUrl = item['user']['avatarUrl'] # 头像地址
 comment_info = unicode(userID) + u" " + nickname + u" " + avatarUrl + u" " + unicode(comment_time) + u" " + unicode(likedCount) + u" " + comment + u"\n"
 all_comments_list.append(comment_info)
 print("第%d页抓取完毕!" % (i+1))
 return all_comments_list

# 将评论写入文本文件
def save_to_file(list,filename):
 with codecs.open(filename,'a',encoding='utf-8') as f:
 f.writelines(list)
 print("写入文件成功!")

if __name__ == "__main__":
 start_time = time.time() # 开始时间
 url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_186016/?csrf_token="
 filename = u"晴天.txt"
 all_comments_list = get_all_comments(url)
 save_to_file(all_comments_list,filename)
 end_time = time.time() #结束时间
 print("程序耗时%f秒." % (end_time - start_time))

I used the above code to run and captured two of Jay Chou's popular songs "Sunny Day" (with more than 1.3 million comments) and "Confession Balloon" (with more than 200,000 comments), the former ran for about 20 minutes, and the latter lasted for more than 6,600 seconds (that is, nearly 2 hours). The screenshots are as follows:

Share a Python method to crawl popular comments on NetEase Cloud Music

Share a Python method to crawl popular comments on NetEase Cloud Music

Note that I separated them by spaces. Each line has user ID, user nickname, user avatar, address, comment time, total number of likes, and comment content.

The above is the detailed content of Share a Python method to crawl popular comments on NetEase Cloud Music. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Python: Games, GUIs, and MorePython: Games, GUIs, and MoreApr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

Python vs. C  : Applications and Use Cases ComparedPython vs. C : Applications and Use Cases ComparedApr 12, 2025 am 12:01 AM

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

The 2-Hour Python Plan: A Realistic ApproachThe 2-Hour Python Plan: A Realistic ApproachApr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python: Exploring Its Primary ApplicationsPython: Exploring Its Primary ApplicationsApr 10, 2025 am 09:41 AM

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

How Much Python Can You Learn in 2 Hours?How Much Python Can You Learn in 2 Hours?Apr 09, 2025 pm 04:33 PM

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

How to teach computer novice programming basics in project and problem-driven methods within 10 hours?How to teach computer novice programming basics in project and problem-driven methods within 10 hours?Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading?How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading?Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

What should I do if the '__builtin__' module is not found when loading the Pickle file in Python 3.6?What should I do if the '__builtin__' module is not found when loading the Pickle file in Python 3.6?Apr 02, 2025 am 07:12 AM

Error loading Pickle file in Python 3.6 environment: ModuleNotFoundError:Nomodulenamed...

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.