Home  >  Article  >  I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

-
-Original
2018-03-07 16:07:583758browse

Use Python to crawl the entire process of a Taobao product, mine and analyze the product data, and finally draw a conclusion.

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Project content

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

In this case, the product category is selected: sofa.

Quantity: 100 pages, 4400 products in total.

Filter conditions: Tmall, sales volume from high to low, price above 500 yuan.

Project Purpose

Conduct text analysis on product titles and word cloud visualization

Statistical analysis of sales corresponding to different keyword words

Price distribution of products Situation analysis

Sales distribution analysis of commodities

Average sales distribution of commodities in different price ranges

Analysis of the impact of commodity prices on sales

Commodity prices Analysis of the impact on sales

Distribution of product quantity in different provinces or cities

Average sales distribution of products in different provinces

Note: This project only uses the above analysis as the basis example.

Project steps

Data collection: Python crawls Taobao product data

Clean and process the data

Text analysis: jieba word segmentation, wordcloud visualization

Data histogram visualization: barh

Data histogram visualization: hist

Data scatter plot visualization: scatter

Data regression analysis visualization: regplot

Tools & Modules

Tools: Spyder of Anaconda, the code editing tool in this case.

Modules: requests, retrying, missingno, jieba, matplotlib, wordcloud, imread, seaborn, etc.

Crawling data

Because Taobao is anti-crawler, although it uses multi-threading and modifies the headers parameters, it still cannot guarantee 100% crawling every time, so I added a loop crawling , crawling unsuccessful pages each time in a loop until all pages are successfully crawled.

Note: The Taobao product page is in JSON format, and regular expressions are used for parsing here.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Data cleaning and processing

Data cleaning and processing steps can also be completed in Excel and then read in data.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Description: According to the requirements, in this case only item_loc, raw_title, view_price, The four columns of data in view_sales mainly analyze region, title, price, and sales volume.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Data Mining and Analysis

Perform text analysis on the raw_title column title

Use stuttering word segmentation Tool, install the module pip install jieba:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Filter the elements (str) of each list in title_s (list of list format) and remove unnecessary words. That is, all the words in the stopwords list are removed:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

#Because the number of each word is counted below, for the sake of accuracy, here is the filtered Each list element in the data title_clean is deduplicated, that is, each title is divided into unique words.

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

#Observing the words in the word_count table, we found that jieba's default dictionary cannot meet the needs.

Some words (such as removable, non-removable, etc.) are cut. Here, new words are added to the dictionary according to the needs (you can also add or delete directly in the dictionary dict.txt, and then load the modified dict. txt).

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

#Word cloud visualization requires the wordcloud module to be installed.

There are two ways to install the module:

pip install wordcloud

Download Packages installation: pip install package name

Note: Please download the software The package is placed in the Python installation path.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Analytical conclusion:

Combined and complete products account for a large proportion high.

Looking at the sofa material: Fabric sofas account for a high proportion, more than leather sofas.

Looking at sofa styles: simple style is the most popular, followed by Nordic style, and other styles are ranked in order: American, Chinese, Japanese, French, etc.

Looking at house types: small houses account for the highest proportion, followed by large and small houses, and large houses the least.

Statistical analysis of the sum of sales corresponding to different keyword words

Explanation: For example, with the word "simplistic", the sum of sales of products containing the word "simplistic" in the product title will be counted. That is, find the sum of sales of products with a "simple" style.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Visualize the data in the word and w_s_sum columns in the table df_word_sum. (In this example, the top 30 sales words are used for drawing)

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

It can be seen from the chart:

combination products The highest sales volume.

From a category perspective: Fabric sofa sales are very high, far exceeding leather sofas.

Looking at apartment types: sales of sofas are highest in small apartments, followed by large and small apartments, and sales in large apartments are the least.

In terms of style: simple style has the highest sales volume, followed by Nordic style, followed by Chinese style, American style, Japanese style, etc.

Removable and washable and corner sofas have considerable sales volume and are also very popular among consumers.

Analysis of price distribution of commodities

The analysis found that some values ​​are too large. In order to make the visualization effect more intuitive, here we combine our own product conditions and select commodities with a price less than 20,000.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

It can be seen from the chart:

The quantity of goods is generally displayed with the price In the descending ladder situation, the higher the price, the fewer goods are on sale.

There are mostly low-priced products, with the most products priced between 500-1500, followed by those between 1500-3000, and less products priced above 10,000.

There is not much difference in the number of products on sale for products with a price of more than 10,000 yuan.

Sales distribution analysis of goods

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Similarly, in order to make the visualization more intuitive, here we choose the sales volume to be greater than 100's of merchandise.

The code is as follows:

It can be seen from the chart and data:

Only 3.4% of the products have a sales volume of more than 100, among which the products with a sales volume of 100-200 are the most, and 200- The next best between 300.

Sales between 100-500, the number of products shows a downward trend with sales, and the trend is steep, with mostly low-selling products.

There are very few products with sales of more than 500.

The average sales volume distribution of goods in different price ranges

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

From the chart It can be seen that:

The average sales volume of products with prices between 1331-1680 yuan is the highest, followed by those with prices between 951-1331 yuan, and those with prices above 9684 yuan are the lowest.

The overall trend is to increase first and then decrease, but the highest peak is at a relatively low price stage.

It shows that consumers’ demand for sofas is more at the low price stage. The higher the price above 1,680 yuan, the smaller the average sales volume is.

Analysis of the impact of commodity prices on sales

Same as above, in order to make the visualization effect more intuitive, here we combine our own product conditions and select products with a price less than 20,000.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

It can be seen from the chart:

The overall trend: with the price of goods increases, its sales volume decreases, and commodity prices have a great impact on its sales volume.

The sales volume of a few products priced between 500-2500 is very high. The sales volume of most products priced between 2500-5000 is low, and a few products are relatively high. However, the sales volume of products priced above 5000 are very low. There are no products with outstanding sales.

Analysis of the impact of commodity prices on sales

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

It can be seen from the chart :

Overall trend: It can be seen from the linear regression fitting line that product sales show an upward trend with price growth.

The prices of most products are on the low side and sales are also on the low side.

Only a few products with prices ranging from 0 to 20,000 have high sales. Only 3 products with prices from 20,000 to 60,000 have high sales. One product with prices from 60,000 to 100,000 has high sales, and it is the largest one. value.

The distribution of commodity quantity in different provinces

The codes are as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

##It can be seen from the chart:

Guangdong has the most, followed by Shanghai, and Jiangsu third. Especially the number in Guangdong far exceeds that of Jiangsu, Zhejiang, Shanghai and other places, which shows that in the sofa sub-category, Guangdong stores dominate.

The numbers in Jiangsu, Zhejiang and Shanghai are not much different and are basically the same.

Average sales distribution of goods in different provinces

The codes are as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !##Thermodynamic map

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn