Python Tutorial

Example of the process of crawling qq music with Python

零下一度

Jul 18, 2017 pm 03:28 PM

javascriptpythonBase

1. Preface

##There is still a lot of music on qq music. Sometimes I want to download good music, but There is an annoying login process every time when downloading from a web page. So, here comes a qqmusic crawler. At least I think the most important thing for a for loop crawler is to find the URL of the element to be crawled. Let’s start looking (don’t laugh at me if I’m wrong)

<br>

## 2. Python crawls QQ music singles

A video from MOOC that I watched before gave a good explanation of the general steps for writing a crawler. We will also follow this.

## Crawler steps

1. Determine the target

First of all, we need to clarify the goal. This time we crawled the singles of QQ Music singer Andy Lau.

(Baidu Encyclopedia)->Analysis target (strategy: url format (range), data format, web page encoding)->Write code->Execute crawler

2. Analysis target

Song link:

From the screenshot on the left, you can know that singles use paging to arrange song information. Each page displays 30 items, a total of 30 pages. Clicking on the page number or the ">" on the far right will jump to the next page. The browser will send an asynchronous ajax request to the server. From the link, you can see the begin and num parameters, which represent the starting song subscript respectively (the screenshot is the 2nd page, the starting subscript is 30) and one page returns 30 items, and the server responds by returning song information in json format (MusicJsonCallbacksinger_track({"code":0,"data":{"list":[{"Flisten_count1":. .....]})), if you just want to obtain song information alone, you can directly splice the link request and parse the returned json format data. Here we do not use the method of directly parsing the data format. I use the Python Selenium method. After each page of single information is obtained and parsed, click ">" to jump to the next page and continue parsing until all the information is parsed and recorded. Single information. Finally, request the link of each single to obtain detailed single information.

The screenshot on the right is the source code of the web page. All song information is in the div floating layer with the class name mod_songlist, and there is no class name songlist_list. Under the sequence list ul, each sub-element li displays a single, and the a tag under the class name songlist__album contains the link, name and duration of the single.

3. Write code

1 ) Download web page content. Here we use Python’s Urllib standard library and encapsulate a download method:

def download(url, user_agent='wswp', num_retries=2):
    if url is None:
        return None
    print('Downloading:', url)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
    request = urllib.request.Request(url, headers=headers)  # 设置用户代理wswp(Web Scraping with Python)
    try:
        html = urllib.request.urlopen(request).read().decode('utf-8')
    except urllib.error.URLError as e:
        print('Downloading Error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <span style="font-size: 14px;"></span>#2) To parse web page content, here we use the third-party plug-in BeautifulSoup, specifically You can refer to BeautifulSoup API. <p><br></p><pre class="brush:php;toolbar:false">def music_scrapter(html, page_num=0):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        mod_songlist_div = soup.find_all('div', class_='mod_songlist')
        songlist_ul = mod_songlist_div[1].find('ul', class_='songlist__list')
        '''开始解析li歌曲信息'''
        lis = songlist_ul.find_all('li')
        for li in lis:
            a = li.find('div', class_='songlist__album').find('a')
            music_url = a['href']  # 单曲链接
            urls.add_new_url(music_url)  # 保存单曲链接
            # print('music_url:{0} '.format(music_url))
        print('total music link num:%s' % len(urls.new_urls))
        next_page(page_num+1)
    except TimeoutException as err:
        print('解析网页出错:', err.args)
        return next_page(page_num + 1)
    return None

def get_music():
     try:
        while urls.has_new_url():
            # print('urls count:%s' % len(urls.new_urls))
            '''跳转到歌曲链接，获取歌曲详情'''
            new_music_url = urls.get_new_url()
            print('url leave count:%s' % str( len(urls.new_urls) - 1))
            html_data_info = download(new_music_url)
            # 下载网页失败，直接进入下一循环，避免程序中断
            if html_data_info is None:
                continue
            soup_data_info = BeautifulSoup(html_data_info, 'html.parser')
            if soup_data_info.find('div', class_='none_txt') is not None:
                print(new_music_url, '   对不起，由于版权原因，暂无法查看该专辑！')
                continue
            mod_songlist_div = soup_data_info.find('div', class_='mod_songlist')
            songlist_ul = mod_songlist_div.find('ul', class_='songlist__list')
            lis = songlist_ul.find_all('li')
            del lis[0]  # 删除第一个li
            # print('len(lis):$s' % len(lis))
            for li in lis:
                a_songname_txt = li.find('div', class_='songlist__songname').find('span', class_='songlist__songname_txt').find('a')
                if 'https' not in a_songname_txt['href']:  #如果单曲链接不包含协议头，加上
                    song_url = 'https:' + a_songname_txt['href']
                song_name = a_songname_txt['title']
                singer_name = li.find('div', class_='songlist__artist').find('a').get_text()
                song_time =li.find('div', class_='songlist__time').get_text()
                music_info = {}
                music_info['song_name'] = song_name
                music_info['song_url'] = song_url
                music_info['singer_name'] = singer_name
                music_info['song_time'] = song_time
                collect_data(music_info)
     except Exception as err:  # 如果解析异常，跳过
         print('Downloading or parse music information error continue:', err.args)

4. Execute the crawler

<span style="font-size: 16px;">爬虫跑起来了，一页一页地去爬取专辑的链接，并保存到集合中，最后通过get_music()方法获取单曲的名称，链接，歌手名称和时长并保存到Excel文件中。</span><br><span style="font-size: 14px;"><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/001/a1138f33f00f8d95b52fbfe06e562d24-4.png?x-oss-process=image/resize,p_40" class="lazy" alt=""    style="max-width:90%"  style="max-width:90%"><strong><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/001/9282b5f7a1dc4a90cee186c16d036272-5.png?x-oss-process=image/resize,p_40" class="lazy" alt=""></strong></span>

<br>
3. Summary of Python crawling QQ music singles

1. The single uses paging. Switching to the next page uses an asynchronous ajax request to obtain json format data from the server and render it to the page. The browser address bar link remains unchanged. , cannot be requested through spliced links. At first I thought about simulating ajax requests through the Python Urllib library, but then I thought about using Selenium. Selenium can well simulate the real operation of the browser, and the positioning of page elements is also very convenient. It simulates clicking the next page, constantly switching the single pagination, and then parsing the web page source code through BeautifulSoup to obtain the single information.

2.url link manager uses a collection data structure to save single links. Why use collections? Because multiple singles may come from the same album (with the same album URL), this can reduce the number of requests.

class UrlManager(object):<br>    def __init__(self):<br>        self.new_urls = set()  # 使用集合数据结构，过滤重复元素<br>        self.old_urls = set()  # 使用集合数据结构，过滤重复元素

    def add_new_url(self, url):<br>        if url is None:<br>            return<br>        if url not in self.new_urls and url not in self.old_urls:<br>            self.new_urls.add(url)<br><br>    def add_new_urls(self, urls):<br>        if urls is None or len(urls) == 0:<br>            return<br>        for url in urls:<br>            self.add_new_url(url)<br><br>    def has_new_url(self):<br>        return len(self.new_urls) != 0<br><br>    def get_new_url(self):<br>        new_url = self.new_urls.pop()<br>        self.old_urls.add(new_url)<br>        return new_url<br><br>

3. It is very convenient to read and write Excel through the Python third-party plug-in openpyxl

, and the single information can be well saved through the Excel file.

def write_to_excel(self, content):<br>    try:<br>        for row in content:<br>            self.workSheet.append([row['song_name'], row['song_url'], row['singer_name'], row['song_time']])<br>        self.workBook.save(self.excelName)  # 保存单曲信息到Excel文件<br>    except Exception as arr:<br>        print('write to excel error', arr.args)<br><br>

4. Postscript

Finally, I have to celebrate, after all, I successfully crawled the single information of QQ Music . This time we were able to successfully crawl the single, Selenium was indispensable. This time we only used some simple functions of selenium. We will learn more about Selenium in the future, not only in terms of crawlers but also in UI automation.

Points that need to be optimized in the future:

1. There are many download links, and it is slow to download one by one. We plan to use multi-threaded concurrent downloading later.

2. The download speed is too fast. In order to avoid the server disabling the IP and the problem of too frequent access to the same domain name later, there is a waiting mechanism, and there is a wait between each request. interval.

3. Parsing web pages is an important process. Regular expressions, BeautifulSoup and lxml can be used. Currently, the BeautifulSoup library is used. In terms of efficiency, BeautifulSoup is not as efficient as lxml. Later, Will try to use lxml.

The above is the detailed content of Example of the process of crawling qq music with Python. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python: Games, GUIs, and MoreApr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

Python vs. C : Applications and Use Cases ComparedApr 12, 2025 am 12:01 AM

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

The 2-Hour Python Plan: A Realistic ApproachApr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python: Exploring Its Primary ApplicationsApr 10, 2025 am 09:41 AM

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

How Much Python Can You Learn in 2 Hours?Apr 09, 2025 pm 04:33 PM

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

How to teach computer novice programming basics in project and problem-driven methods within 10 hours?Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading?Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

What should I do if the '__builtin__' module is not found when loading the Pickle file in Python 3.6?Apr 02, 2025 am 07:12 AM

Error loading Pickle file in Python 3.6 environment: ModuleNotFoundError:Nomodulenamed...

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software