Home  >  Article  >  Backend Development  >  Implementation method of cdn log analysis using pandas library

Implementation method of cdn log analysis using pandas library

Y2J
Y2JOriginal
2017-05-16 13:20:431847browse

This article mainly introduces the relevant information about using the pandas library in Python for cdn log analysis. The article shares the complete sample code of pandas for cdn log analysis, and then introduces the relevant content about the pandas library in detail. Friends who need it You can use it as a reference, let’s take a look below.

Preface

I recently encountered a need at work, which is to filter some data based on CDN logs, such as traffic and status code statistics. TOP IP, URL, UA, Referer, etc. In the past, the bash shell was used to implement this. However, when the log volume is large, the number of log files is gigabytes, and the number of lines reaches tens of billions, processing through the shell is not enough and the processing time is too long. So I studied the use of Python pandas, a data processing library. Ten million lines of logs are processed in about 40 seconds.

Code

#!/usr/bin/python
# -*- coding: utf-8 -*-
# sudo pip install pandas
author = 'Loya Chen'
import sys
import pandas as pd
from collections import OrderedDict
"""
Description: This script is used to analyse qiniu cdn log.
================================================================================
日志格式
IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"
================================================================================
日志示例
 [0] [1][2]  [3]  [4]   [5]
101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -" 
[6] [7] [8]    [9]
200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
================================================================================
"""
if len(sys.argv) != 2:
 print('Usage:', sys.argv[0], 'file_of_log')
 exit() 
else:
 log_file = sys.argv[1] 
# 需统计字段对应的日志位置 
ip  = 0
url  = 5
status_code = 6
size = 7
referer = 8
ua  = 9
# 将日志读入DataFrame
reader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True)
loop = True
chunkSize = 10000000
chunks = []
while loop:
 try:
 chunk = reader.get_chunk(chunkSize)
 chunks.append(chunk)
 except StopIteration:
 #Iteration is stopped.
 loop = False
df = pd.concat(chunks, ignore_index=True)
byte_sum = df[size].sum()        #流量统计
top_status_code = pd.DataFrame(df[6].value_counts())      #状态码统计
top_ip  = df[ip].value_counts().head(10)      #TOP IP
top_referer = df[referer].value_counts().head(10)      #TOP Referer
top_ua  = df[ua].value_counts().head(10)      #TOP User-Agent
top_status_code['persent'] = pd.DataFrame(top_status_code/top_status_code.sum()*100)
top_url  = df[url].value_counts().head(10)      #TOP URL
top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最大的URL
top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最多的IP
# 将结果有序存入字典
result = OrderedDict([("流量总计[单位:GB]:"   , byte_sum/1024/1024/1024),
   ("状态码统计[次数|百分比]:"  , top_status_code),
   ("IP TOP 10:"    , top_ip),
   ("Referer TOP 10:"   , top_referer),
   ("UA TOP 10:"    , top_ua),
   ("URL TOP 10:"   , top_url),
   ("请求流量最大的URL TOP 10[单位:MB]:" , top_url_byte), 
   ("请求流量最大的IP TOP 10[单位:MB]:" , top_ip_byte)
])
# 输出结果
for k,v in result.items():
 print(k)
 print(v)
 print('='*80)

pandas study notes

There are two basic data structures in Pandas, Series and Dataframe. A Series is an object similar to a one-dimensional array, consisting of a set of data and an index . Dataframe is a table type data structure with both row and column indexes.

from pandas import Series, DataFrame
import pandas as pd

Series

In [1]: obj = Series([4, 7, -5, 3])
In [2]: obj
Out[2]: 
0 4
1 7
2 -5
3 3

The string representation of Series is: index on the left, value on the right. When no index is specified, an integer type index from 0 to N-1 (N is the length of the data) will be automatically created. The array representation and index object can be obtained through the values ​​and index properties of the Series:

In [3]: obj.values
Out[3]: array([ 4, 7, -5, 3])
In [4]: obj.index
Out[4]: RangeIndex(start=0, stop=4, step=1)

Usually the index is specified when creating the Series:

In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [6]: obj2
Out[6]: 
d 4
b 7
a -5
c 3

Obtain the Series through the index A single or a set of values:

In [7]: obj2['a']
Out[7]: -5
In [8]: obj2[['c','d']]
Out[8]: 
c 3
d 4

Sort

In [9]: obj2.sort_index()
Out[9]: 
a -5
b 7
c 3
d 4
In [10]: obj2.sort_values()
Out[10]: 
a -5
c 3
d 4
b 7

Filter operation

In [11]: obj2[obj2 > 0]
Out[11]: 
d 4
b 7
c 3
In [12]: obj2 * 2
Out[12]: 
d 8
b 14
a -10
c 6

Member

In [13]: 'b' in obj2
Out[13]: True
In [14]: 'e' in obj2
Out[14]: False

Create Series from dictionary

In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}
In [16]: obj3 = Series(sdata)
In [17]: obj3
Out[17]: 
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000

If only one dictionary is passed in, the index in the resulting Series is the key of the original dictionary (ordered arrangement)

In [18]: states = ['Beijing', 'Hangzhou', 'Shanghai', 'Suzhou']
In [19]: obj4 = Series(sdata, index=states)
In [20]: obj4
Out[20]: 
Beijing 40000.0
Hangzhou 30000.0
Shanghai 35000.0
Suzhou  NaN

When index is specified, the three values ​​in sdata that match the states index will be found. And put it in the response position, but because the sdata value corresponding to 'Suzhou' cannot be found, the result is NaN (not a number), which is used in pandas to represent missing or NA values

pandas isnull and notnull

Function can be used to detect missing data:

In [21]: pd.isnull(obj4)
Out[21]: 
Beijing False
Hangzhou False
Shanghai False
Suzhou True
In [22]: pd.notnull(obj4)
Out[22]: 
Beijing True
Hangzhou True
Shanghai True
Suzhou False

Series also has similar instance methods

In [23]: obj4.isnull()
Out[23]: 
Beijing False
Hangzhou False
Shanghai False
Suzhou True

An important function of Series is to automatically Align data with different indexes

In [24]: obj3
Out[24]: 
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000
In [25]: obj4
Out[25]: 
Beijing 40000.0
Hangzhou 30000.0
Shanghai 35000.0
Suzhou  NaN
In [26]: obj3 + obj4
Out[26]: 
Beijing 80000.0
Hangzhou 60000.0
Nanjing  NaN
Shanghai 70000.0
Suzhou  NaN

The index of Series can be modified in-place by copying

In [27]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [28]: obj
Out[28]: 
Bob 4
Steve 7
Jeff -5
Ryan 3

DataFrame

pandas

Read the file

In [29]: df = pd.read_table('pandas_test.txt',sep=' ', names=['name', 'age'])
In [30]: df
Out[30]: 
 name age
0 Bob 26
1 Loya 22
2 Denny 20
3 Mars 25

DataFrame column selection

df[name]
In [31]: df['name']
Out[31]: 
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object

DataFrame row selection

df.iloc[0,:] #第一个参数是第几行,第二个参数是列。这里指第0行全部列
df.iloc[:,0] #全部行,第0列
In [32]: df.iloc[0,:]
Out[32]: 
name Bob
age 26
Name: 0, dtype: object
In [33]: df.iloc[:,0]
Out[33]: 
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object

To obtain an element, you can use iloc, the faster way is iat

In [34]: df.iloc[1,1]
Out[34]: 22
In [35]: df.iat[1,1]
Out[35]: 22

DataFrame block selection

In [36]: df.loc[1:2,['name','age']]
Out[36]: 
 name age
1 Loya 22
2 Denny 20

Filter rows based on conditions

Add judgment conditions in square brackets to filter rows. The conditions must return True or False

In [37]: df[(df.index >= 1) & (df.index <= 3)]
Out[37]: 
 name age city
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
In [38]: df[df[&#39;age&#39;] > 22]
Out[38]: 
 name age city
0 Bob 26 Beijing
3 Mars 25 Nanjing

Add columns

In [39]: df[&#39;city&#39;] = [&#39;Beijing&#39;, &#39;Shanghai&#39;, &#39;Hangzhou&#39;, &#39;Nanjing&#39;]
In [40]: df
Out[40]: 
 name age city
0 Bob 26 Beijing
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing

Sort


Sort by specified column

In [41]: df.sort_values(by=&#39;age&#39;)
Out[41]: 
 name age city
2 Denny 20 Hangzhou
1 Loya 22 Shanghai
3 Mars 25 Nanjing
0 Bob 26 Beijing
# 引入numpy 构建 DataFrame
import numpy as np
In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=[&#39;three&#39;, &#39;one&#39;], columns=[&#39;d&#39;, &#39;a&#39;, &#39;b&#39;, &#39;c&#39;])
In [43]: df
Out[43]: 
 d a b c
three 0 1 2 3
one 4 5 6 7
# 以索引排序
In [44]: df.sort_index()
Out[44]: 
 d a b c
one 4 5 6 7
three 0 1 2 3
In [45]: df.sort_index(axis=1)
Out[45]: 
 a b c d
three 1 2 3 0
one 5 6 7 4
# 降序
In [46]: df.sort_index(axis=1, ascending=False)
Out[46]: 
 d c b a
three 0 3 2 1
one 4 7 6 5

View

# 查看表头5行 
df.head(5)
# 查看表末5行
df.tail(5) 
# 查看列的名字
In [47]: df.columns
Out[47]: Index([&#39;name&#39;, &#39;age&#39;, &#39;city&#39;], dtype=&#39;object&#39;)
# 查看表格当前的值
In [48]: df.values
Out[48]: 
array([[&#39;Bob&#39;, 26, &#39;Beijing&#39;],
 [&#39;Loya&#39;, 22, &#39;Shanghai&#39;],
 [&#39;Denny&#39;, 20, &#39;Hangzhou&#39;],
 [&#39;Mars&#39;, 25, &#39;Nanjing&#39;]], dtype=object)

Transpose

df.T
Out[49]: 
  0  1  2 3
name Bob Loya Denny Mars
age 26 22 20 25
city Beijing Shanghai Hangzhou Nanjing

Use isin

In [50]: df2 = df.copy()
In [51]: df2[df2[&#39;city&#39;].isin([&#39;Shanghai&#39;,&#39;Nanjing&#39;])]
Out[52]: 
 name age city
1 Loya 22 Shanghai
3 Mars 25 Nanjing

Operation operation:

In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], 
 ...:    index=[&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;], columns=[&#39;one&#39;, &#39;two&#39;])
In [54]: df
Out[54]: 
 one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
#按列求和
In [55]: df.sum()
Out[55]: 
one 9.25
two -5.80
# 按行求和
In [56]: df.sum(axis=1)
Out[56]: 
a 1.40
b 2.60
c NaN
d -0.55

group


group refers to the following steps:


    ##Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure
  • See the Grouping section
In [57]: df = pd.DataFrame({&#39;A&#39; : [&#39;foo&#39;, &#39;bar&#39;, &#39;foo&#39;, &#39;bar&#39;,
 ....:    &#39;foo&#39;, &#39;bar&#39;, &#39;foo&#39;, &#39;foo&#39;],
 ....:   &#39;B&#39; : [&#39;one&#39;, &#39;one&#39;, &#39;two&#39;, &#39;three&#39;,
 ....:    &#39;two&#39;, &#39;two&#39;, &#39;one&#39;, &#39;three&#39;],
 ....:   &#39;C&#39; : np.random.randn(8),
 ....:   &#39;D&#39; : np.random.randn(8)})
 ....: 
In [58]: df
Out[58]: 
 A B  C  D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033

Group it, and then apply the sum function

In [59]: df.groupby(&#39;A&#39;).sum()
Out[59]: 
  C D
A   
bar -2.802588 2.42611
foo 3.146492 -0.63958
In [60]: df.groupby([&#39;A&#39;,&#39;B&#39;]).sum()
Out[60]: 
   C  D
A B   
bar one -1.814470 2.395985
 three -0.595447 0.166599
 two -0.392670 -0.136473
foo one -1.195665 -0.616981
 three 1.928123 -1.623033
 two 2.414034 1.600434

[Related recommendations]

1.

Special recommendation : "php Programmer Toolbox" V0.1 version download2.

Python free video tutorial

3.

Python Basics Introduction Tutorial

The above is the detailed content of Implementation method of cdn log analysis using pandas library. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn