First is the preparation work
Python 2.7.11: Download python
Pycharm: Download Pycharm
where python2 and python3 are currently Synchronous release, I am using python2 as the environment here. Pycharm is a relatively efficient Python IDE, but it requires payment.
Basic ideas for implementation
First of all, our target website: Android market
Click [App] to enter our Key page:
After jumping to the application interface, we need to pay attention to three places. The red box in the picture below indicates:
First pay attention to the URL in the address bar, then pay attention to the free download button, and then pay attention to the page turning options at the bottom. Clicking the "Free Download" button will immediately download the corresponding APP, so our idea is to get the click-to-download link and download the APP directly.
Writing a crawler
The first point that needs to be solved: How do we get the download link mentioned above? Here I have to introduce the basic principles of how browsers display web pages. To put it simply, the browser is a tool similar to a parser. When it gets HTML and other codes, it will parse and render according to the corresponding rules, so that we can see the page.
I am using Google Chrome here. Right-click on the page and click "Inspect" to see the original HTML code of the webpage:
Look Don’t worry if you encounter dazzling HTML codes. Google Chrome’s review element has a useful little function that can help us locate the HTML codes corresponding to page controls.
Location:
As shown in the picture above, click the small arrow in the rectangular box above, click the corresponding position on the page, and the HTML code on the right will be automatically positioned and highlighted.
Next we locate the HTML code corresponding to the download button:
You can see that in the code corresponding to the button, there is a corresponding download link: [/ appdown/com.tecent.mm], plus the prefix, the complete download link is http://apk.hiapk.com/appdown/com.tecent.mm
First use python Getting the HTML of the entire page is very simple, just use "requests.get(url)
" and fill in the corresponding URL.
Next, when crawling the key information of the page, adopt the idea of "grab the big ones first, then the small ones". You can see that there are 10 APPs on a page, corresponding to 10 items in the HTML code:
And each li tag contains the attributes (name) of the respective APP. , download link, etc.). So in the first step, we extract these 10 li tags:
def geteveryapp(self,source): everyapp = re.findall('(<li class="list_item".*?</li>)',source,re.S) #everyapp2 = re.findall('(<p class="button_bg button_1 right_mt">.*?</p>)',everyapp,re.S) return everyapp
Simple regular expression knowledge is used here
Extract the download link in the li tag:
def getinfo(self,eachclass): info = {} str1 = str(re.search('<a href="(.*?)">', eachclass).group(0)) app_url = re.search('"(.*?)"', str1).group(1) appdown_url = app_url.replace('appinfo', 'appdown') info['app_url'] = appdown_url print appdown_url return info
The next difficulty is turning pages , after clicking the page turning button below, we can see that the address bar has changed as follows:
豁然开朗,我们可以在每次的请求中替换URL中对应的id值实现翻页。
def changepage(self,url,total_page): now_page = int(re.search('pi=(\d)', url).group(1)) page_group = [] for i in range(now_page,total_page+1): link = re.sub('pi=\d','pi=%s'%i,url,re.S) page_group.append(link) return page_group
爬虫效果
关键位置说完了,我们先看下最后爬虫的效果:
在TXT文件中保存结果如下:
直接复制进迅雷就可以批量高速下载了。
附上全部代码
#-*_coding:utf8-*- import requests import re import sys reload(sys) sys.setdefaultencoding("utf-8") class spider(object): def __init__(self): print u'开始爬取内容' def getsource(self,url): html = requests.get(url) return html.text def changepage(self,url,total_page): now_page = int(re.search('pi=(\d)', url).group(1)) page_group = [] for i in range(now_page,total_page+1): link = re.sub('pi=\d','pi=%s'%i,url,re.S) page_group.append(link) return page_group def geteveryapp(self,source): everyapp = re.findall('(
总结
选取的目标网页相对结构清晰简单,这是一个比较基本的爬虫。代码写的比较乱请见谅,以上就是这篇文章的全部内容了,希望能对大家的学习或者工作带来一定的帮助,如果有问题大家可以留言交流。
更多Python method to crawl APP download link相关文章请关注PHP中文网!

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

Serialization and deserialization of Python objects are key aspects of any non-trivial program. If you save something to a Python file, you do object serialization and deserialization if you read the configuration file, or if you respond to an HTTP request. In a sense, serialization and deserialization are the most boring things in the world. Who cares about all these formats and protocols? You want to persist or stream some Python objects and retrieve them in full at a later time. This is a great way to see the world on a conceptual level. However, on a practical level, the serialization scheme, format or protocol you choose may determine the speed, security, freedom of maintenance status, and other aspects of the program

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

Python's statistics module provides powerful data statistical analysis capabilities to help us quickly understand the overall characteristics of data, such as biostatistics and business analysis. Instead of looking at data points one by one, just look at statistics such as mean or variance to discover trends and features in the original data that may be ignored, and compare large datasets more easily and effectively. This tutorial will explain how to calculate the mean and measure the degree of dispersion of the dataset. Unless otherwise stated, all functions in this module support the calculation of the mean() function instead of simply summing the average. Floating point numbers can also be used. import random import statistics from fracti

This tutorial builds upon the previous introduction to Beautiful Soup, focusing on DOM manipulation beyond simple tree navigation. We'll explore efficient search methods and techniques for modifying HTML structure. One common DOM search method is ex

This article guides Python developers on building command-line interfaces (CLIs). It details using libraries like typer, click, and argparse, emphasizing input/output handling, and promoting user-friendly design patterns for improved CLI usability.

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

WebStorm Mac version
Useful JavaScript development tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version
Chinese version, very easy to use

Dreamweaver Mac version
Visual web development tools