本文轉載自微信公眾號「尤而小屋」,作者尤而小屋 。轉載本文請聯絡尤而小屋公眾號。
大家好,我是Peter~
最近取得了一份IC電子產品電商數據,後面會進行3個主題的數據分析與挖掘:
本文是第一個階段,主要內容包含:
import pandas as pd import numpy as np import time import os from datetime import datetime import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline #设置中文编码和负号的正常显示 plt.rcParams['font.sans-serif']=['SimHei'] plt.rcParams['axes.unicode_minus']=False import plotly_express as px import plotly.graph_objects as go import missingno as ms from sklearn.cluster import KMeans from sklearn.preprocessing import MinMaxScaler資料基本資訊讀取資料
df = pd.read_csv( "ic_sale.csv", encoding="utf-8",# 指定编码 cnotallow={"order_id":str,"product_id":str,"category_id":str,"user_id":str} # 指定字段类型 ) df.head()
# 1、数据shape df.shapeOut[3]:
(564169, 11)In [4]:
# 2、数据字段类型 df.dtypesOut[4]:
event_timeobject order_idobject product_idobject category_id object category_code object brand object pricefloat64 user_id object ageint64 sex object local object dtype: objectIn [5]:描述統計資訊是針對數值類型的欄位:
# 3、数据描述统计信息 df.describe()Out[5]:
price | #age | |
count | 564169.000000 | #564169.000000 |
mean | 208.269324 | #33.184388 |
std### | 304.559875 |
#10.122088 |
0.000000 |
||
25% |
# 23.130000 |
|
##50% |
#87.940000 |
|
#75% |
277.750000 |
In [6]:
# 4、总共多少个不同客户 df["user_id"].nunique()
Out[6]:
6908
In [7]:
# 5、总共多少个不同品牌 df["brand"].nunique()
Out[7]:
868
In [8]:
# 6、总共多少个订单 df["order_id"].nunique()
Out[8]:
234232
In [9]:
# 7、总共多少个产品 df["product_id"].nunique()
Out[9]:
3756
从描述统计信息中发现price字段的最小值是0,应该是没有成交的数据;我们选择price大于0的信息:
In [10]:
df = df[df["price"] > 0]
In [11]:
df.isnull().sum()
Out[11]:
event_time0 order_id0 product_id0 category_id 0 category_code129344 brand 27215 price 0 user_id 0 age 0 sex 0 local 0 dtype: int64
可以看到缺失值体现在字段:
In [12]:
ms.bar(df,color="blue")# 缺失值可视化 plt.show()
In [13]:
df.fillna("missing",inplace=True)
In [14]:
df.isnull().sum()# 填充之后无缺失值
Out[14]:
event_time 0 order_id 0 product_id 0 category_id0 category_code0 brand0 price0 user_id0 age0 sex0 local0 dtype: int64
读进来的数据中时间字段是object类型,需要将其转成时间格式的类型
In [15]:
df["event_time"][:5] # 处理前
Out[15]:
02020-04-24 11:50:39 UTC 12020-04-24 11:50:39 UTC 22020-04-24 14:37:43 UTC 32020-04-24 14:37:43 UTC 42020-04-24 19:16:21 UTC Name: event_time, dtype: object
In [16]:
# 去掉最后的UTC df["event_time"] = df["event_time"].apply(lambda x: x[:19])
In [17]:
# 时间数据类型转化:字符类型---->指定时间格式 df['event_time'] = pd.to_datetime(df['event_time'], format="%Y-%m-%d %H:%M:%S")
In [18]:
# 提取多个时间相关字段 df['month']=df['event_time'].dt.month df['day'] = df['event_time'].dt.day df['dayofweek']=df['event_time'].dt.dayofweek df['hour']=df['event_time'].dt.hour
In [19]:
df["event_time"][:5] # 处理后
Out[19]:
0 2020-04-24 11:50:39 1 2020-04-24 11:50:39 2 2020-04-24 14:37:43 3 2020-04-24 14:37:43 4 2020-04-24 19:16:21 Name: event_time, dtype: datetime64[ns]
可以看到字段类型已经发生了变化
In [20]:
amount_by_month = df.groupby("month")["price"].sum().reset_index() amount_by_month
Out[20]:
month |
price |
|
0 |
1 |
1953358.17 |
1 |
2 |
2267809.88 |
2 |
3 |
2897486.26 |
3 |
4 |
1704422.41 |
#4 |
#5 |
#7768637.79 |
#5 |
6 |
#7691244.33 |
6 | ||
7 | 16354029.27############7#### ########8############27982605.44##################8########## # ###9### | 17152310.57 |
#9 |
##10 | #19765680.76 |
#10 | 11 | 11961511.52 |
In [21]:
fig = px.scatter(amount_by_month,x="month",y="price",size="price",color="price") fig.update_layout(height=500, width=1000, title_text="每月成交金额") fig.show()
可以看到:
In [22]:
order_by_month = df.groupby("month")["order_id"].nunique().reset_index() order_by_month
Out[22]:
month |
order_id |
|||
0 |
1 |
10353 |
||
1 |
2 |
11461 |
||
2 |
3 |
12080 |
||
3 |
4 |
9001 |
||
4 |
5 |
#30460 |
||
5 |
6 |
#28978 |
||
7 | #57659 | |||
## 7 | ###8############73897###################8## ## ########9############345####################9##### ##### ###10### | ##14 | ||
10 | 11 | 6 |
month |
nunique |
count |
|
0 |
1 |
1388 |
15575 |
1 |
2 |
1508 |
17990 |
2 |
3 |
1597 |
##18687 |
#3 | 4 | 1525 | 11867 |
#4 | 5 | 3168 | 40332 |
5 | #6 | 3966 | #41355 |
7 | 5159 | 76415 | |
##7 | 8 | #6213 | 100006 |
8 | #9 | ##549770496 | |
10 | 10 |
| 4597|
104075 |
10 | 11 | 3134 |
In [25]:
fig = px.line(people_by_month,x="month",y="nunique") fig.update_layout(height=500, width=1000, title_text="每月成交人数") fig.show()
fig = px.line(people_by_month,x="month",y="count") fig.update_layout(height=500, width=1000, title_text="每月成交人次") fig.show()
In [27]:
amount_by_month# 每月成交金额
Out[27]:
month |
price |
|
0 |
1 |
1953358.17 |
1 |
2 |
2267809.88 |
2 |
3 |
2897486.26 |
3 |
4 |
1704422.41 |
#4 |
#5 |
#7768637.79 |
#5 |
6 |
#7691244.33 |
6 | ||
7 | 16354029.27############7#### ########8############27982605.44##################8########## # ###9### | 17152310.57 |
#9 |
##10 | #19765680.76 |
#10 | 11 | 11961511.52 |
In [28]:
order_by_month# 每月订单数
Out[28]:
month |
order_id |
|||
0 |
1 |
10353 |
||
1 |
2 |
11461 |
||
2 |
3 |
12080 |
||
3 |
4 |
9001 |
||
4 |
5 |
#30460 |
||
5 |
6 |
#28978 |
||
7 | #57659 | |||
## 7 | ###8############73897###################8## ## ########9############345####################9##### ##### ###10### | ##14 | ||
10 | 11 | 6 |
month |
price |
order_id |
||
0 |
1 |
1953358.17 |
10353 |
|
1 |
2 |
2267809.88 |
11461 |
|
2 |
3 |
2897486.26 |
12080 |
|
3 |
##4 | 1704422.41 | 9001 | |
4 | #5 | 7768637.79 | ##30460
|
|
6 | #7691244.33 | # #28978 | ||
#7 | ##16354029.27 | ##57659 |
#7 | |
##8 | ##27982605.44 | #73897 | ||
8 | #9 | 17152310.57 | #345 | |
#9 | 10 | 19765680.76 | 14 | |
#10 | 11 | #11961511.52 | #6
|
order_id |
product_id |
|
234208 |
2388440981134640000 |
15021 |
234210 |
2388440981134660000 |
14891 |
234211 |
2388440981134670000 |
14845 |
234212 |
##2388440981134680000 | #14765 |
#234202 | #2388440981134580000 | #14587 |
234205 |
##234205 | |
## 2388440981134610000 |
#14571 |
|
234207 | ##238844098113463000 | 14443 |
#234204 |
2388440981134600000 |
################# ## ##14416##################234206############2388440981134620000############14414 ## # |
234203 |
#2388440981134590000 |
##14194 |
In [33]:
fig = px.bar(product_by_order[:20], x="order_id", y="product_id", text="product_id" ) fig.show()
一个订单下包含的产品数量是不同;上万的订单可能是小型的ic元器件产品。
不同省份下的订单量、用户量和成交金额对比
In [34]:
local = df.groupby("local").agg({"order_id":"nunique","user_id":"nunique","price":sum}).reset_index() local.head()
Out[34]:
local |
order_id |
user_id |
price |
|
0 |
上海 |
39354 |
5680 |
19837942.20 |
1 |
北京 |
38118 |
5702 |
19137748.75 |
2 |
四川 |
##13396 | ##35896770891.28 | |
天津 | 13058 | 3497 | 6433736.85 | |
##4 | 廣東 | #51471 | 6085 | 26013770.86 |
In [35]:
df1 = local.sort_values("order_id",ascending=True)# 订单量升序 df1
Out[35]:
local |
order_id |
user_id |
price |
|
6 |
浙江 |
12790 |
3485 |
6522657.59 |
8 |
湖北 |
12810 |
3488 |
5993820.57 |
3 |
天津 |
13058 |
3497 |
6433736.85 |
##10 |
重慶 |
#13058 |
3496 |
#6479488.14 |
#7 |
海南 |
|||
# #13076 | 3587 | #6968674.41 | ##########2### #########四川#############13396#############3589### | 6770891.28 |
#5 |
Pf |
|||
#13575 | 3598 | #6357286.87 | #9 |
|
13879 | 3481 | #6983078.88 | 1 |
|
###38118### #########5702############19137748.75##################0### | 上海 |
39354 |
5680 |
19837942.20 |
4 |
廣東 |
51471 |
##6085 | 26013770.86 |
dayofweek |
order_id |
|
0 |
0 |
35690 |
1 |
1 |
34256 |
2 |
2 |
#31249 |
3 |
#3 |
#31555 |
4 | 4 | 33010 |
5 | 5 | #34772 |
6 | 6 | 33922 |
hour |
order_id |
|
0 |
0 |
2865 |
1 |
1 |
2711 |
2 |
2 |
3981 |
3 |
3 |
6968 |
4 |
##4 | 12176 |
5 | 5 | 16411 |
#6 | #618667 | |
7 |
| #20034|
8 | 20261 | |
#9 | ##20507 |
以上是基於機器學習的電商資料探勘 | 資料探索篇的詳細內容。更多資訊請關注PHP中文網其他相關文章!