Home  >  Article  >  Backend Development  >  Python3 crawler example NetEase Cloud music crawler

Python3 crawler example NetEase Cloud music crawler

青灯夜游
青灯夜游forward
2018-10-23 16:35:154118browse

This article brings you the NetEase Cloud Music Crawler example of Python3 crawler. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

The goal this time is to crawl all comments on the specified song on NetEase Cloud Music and generate a word cloud

Specific steps:

1: Implement JS encryption

It is not difficult to find this ajax interface. The problem is that the data passed is obtained through js encryption, so you need to check the js code.

By cutting off debugging, you can find that the data is encrypted by the window.asrsea function in core_8556f33641851a422ec534e33e6fa5a4.js?8556f33641851a422ec534e33e6fa5a4.js.

Through further search, you can find the following function:

function() {
    // 生成长度为16的随机字符串
    function a(a) {
        var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = "";
        for (d = 0; a > d; d += 1)
            e = Math.random() * b.length,
            e = Math.floor(e),
            c += b.charAt(e);
        return c
    }
    // 实现AES加密
    function b(a, b) {
        var c = CryptoJS.enc.Utf8.parse(b)
          , d = CryptoJS.enc.Utf8.parse("0102030405060708")
          , e = CryptoJS.enc.Utf8.parse(a)
          , f = CryptoJS.AES.encrypt(e, c, {
            iv: d,
            mode: CryptoJS.mode.CBC
        });
        return f.toString()
    }
    // 实现RSA加密
    function c(a, b, c) {
        var d, e;
        return setMaxDigits(131),
        d = new RSAKeyPair(b,"",c),
        e = encryptedString(d, a)
    }
    // 得到加密后的结果
    function d(d, e, f, g) {
        var h = {}
          , i = a(16);
        return h.encText = b(d, g),
        h.encText = b(h.encText, i),
        h.encSecKey = c(i, e, f),
        h
    }
    function e(a, b, d, e) {
        var f = {};
        return f.encText = c(a + e, b, d),
        f
    }
}()

So we need to use Python to implement the above four functions. The first function to generate a random string is not difficult. The implemented code is as follows:

# 生成随机字符串
def generate_random_string(length):
    string = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
    # 初始化随机字符串
    random_string = ""
    # 生成一个长度为length的随机字符串
    for i in range(length):
        random_string += string[int(floor(random() * len(string)))]
    return random_string

The second is a function to implement AES encryption, and using AES encryption requires the use of the Crypto library. If this library is not installed, you need to install the pycrypto library first, and then install the Crypto library. After successful installation, If there is no Crypto but only crypto when importing, first open the Lib\site-packages\crypto folder in the Python installation directory. If there is a Cipher folder in it, return to ## Rename crypto to Crypto in the #Lib\site-packages directory, and then it should be imported successfully.

Since the plaintext length of AES encryption must be a multiple of 16, we need to perform necessary padding on the plaintext to satisfy that its length is a multiple of 16. The AES encryption mode is AES.MODE_CBC, initialization vector iv ='0102030405060708′.

The code to implement AES encryption is as follows:

# AES加密
def aes_encrypt(msg, key):
    # 如果不是16的倍数则进行填充
    padding = 16 - len(msg) % 16
    # 这里使用padding对应的单字符进行填充
    msg += padding * chr(padding)
    # 用来加密或者解密的初始向量(必须是16位)
    iv = '0102030405060708'
    # AES加密
    cipher = AES.new(key, AES.MODE_CBC, iv)
    # 加密后得到的是bytes类型的数据
    encrypt_bytes = cipher.encrypt(msg)
    # 使用Base64进行编码,返回byte字符串
    encode_string = base64.b64encode(encrypt_bytes)
    # 对byte字符串按utf-8进行解码
    encrypt_text = encode_string.decode('utf-8')
    # 返回结果
    return encrypt_text

The third is the function to implement RSA encryption,

In RSA encryption, both plaintext and ciphertext are numbers, The ciphertext of RSA is the result of finding mod N to the power of E of the number representing the plaintext. The length of the string obtained after RSA encryption is 256. If it is not long enough, we fill it with x characters.

The code to implement RSA encryption is as follows:

# RSA加密
def rsa_encrypt(random_string, key, f):
    # 随机字符串逆序排列
    string = random_string[::-1]
    # 将随机字符串转换成byte类型数据
    text = bytes(string, 'utf-8')
    # RSA加密
    sec_key = int(codecs.encode(text, encoding='hex'), 16) ** int(key, 16) % int(f, 16)
    # 返回结果
    return format(sec_key, 'x').zfill(256)

The fourth function is a function that gets two encryption parameters. The four parameters passed in, the first parameter

JSON.stringify (i3x) is the following content, of which the offset and limit parameters are required. The value of offset is (number of pages-1)*20, and the value of limit is 20

'{"offset":'+str(offset)+',"total":"True","limit":"20","csrf_token":""}'

The second parameter , the values ​​of the third parameter and the fourth parameter are obtained according to Zj4n.emj:


The value of encText is passed through AES twice Encrypted, encSecKey is obtained through RSA encryption, the specific code for implementation is as follows:

# 获取参数
def get_params(page):
    # 偏移量
    offset = (page - 1) * 20
    # offset和limit是必选参数,其他参数是可选的
    msg = '{"offset":' + str(offset) + ',"total":"True","limit":"20","csrf_token":""}'
    key = '0CoJUm6Qyw8W8jud'
    f = '00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a87' \
        '6aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9' \
        'd05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b' \
        '8e289dc6935b3ece0462db0a22b8e7'
    e = '010001'
    # 生成长度为16的随机字符串
    i = generate_random_string(16)
    # 第一次AES加密
    enc_text = aes_encrypt(msg, key)
    # 第二次AES加密之后得到params的值
    encText = aes_encrypt(enc_text, i)
    # RSA加密之后得到encSecKey的值
    encSecKey = rsa_encrypt(i, e, f)
    return encText, encSecKey

2. Parse and save comments


You can find it by viewing the preview information The username and comment content are all stored in json format data

#so it will be very easy to parse, just extract the nickname and content directly. The obtained data is saved in a txt file with the file name of the song. The implemented code is as follows:

# 爬取评论内容
def get_comments(data):
    # data=[song_id,song_name,page_num]
    url = 'https://music.163.com/weapi/v1/resource/comments/R_SO_4_' + str(data[0]) + '?csrf_token='
    # 得到两个加密参数
    text, key = get_params(data[2])
    # 发送post请求
    res = requests.post(url, headers=headers, data={"params": text, "encSecKey": key})
    if res.status_code == 200:
        print("正在爬取第{}页的评论".format(data[2]))
        # 解析
        comments = res.json()['comments']
        # 存储
        with open(data[1] + '.txt', 'a', encoding="utf-8") as f:
            for i in comments:
                f.write(i['content'] + "\n")
    else:
        print("爬取失败!")

3. Generate word cloud

Before proceeding with this step, you need to install the jieba and wordcloud modules. The jieba module is a module used for Chinese word segmentation. , the wordcloud module is a module used to generate word clouds, which you can learn by yourself.

I won’t go into details about this part. The specific code is as follows:

# 生成词云
def make_cloud(txt_name):
    with open(txt_name + ".txt", 'r', encoding="utf-8") as f:
        txt = f.read()
    # 结巴分词
    text = ''.join(jieba.cut(txt))
    # 定义一个词云
    wc = WordCloud(
        font_path="font.ttf",
        width=1200,
        height=800,
        max_words=100,
        max_font_size=200,
        min_font_size=10
    )
    # 生成词云
    wc.generate(text)
    # 保存为图片
    wc.to_file(txt_name + ".png")
The complete code has been uploaded to github (including font.ttf file): https://github.com/QAQ112233/WangYiYun

The above is the detailed content of Python3 crawler example NetEase Cloud music crawler. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:cnblogs.com. If there is any infringement, please contact admin@php.cn delete