search
HomeBackend DevelopmentPython TutorialGetting started with Python data processing library pandas

Getting started with Python data processing library pandas

Apr 19, 2018 am 10:45 AM
pandaspythondata processing

pandas is a Python language software package. When we use Python language for machine learning programming, this is a very commonly used basic programming library. This article is an introductory tutorial on the Python data processing library pandas. It is very good. Friends who are interested should take a look.

pandas is a Python language software package. When we use Python language for machine learning programming, this It is a very commonly used basic programming library. This article is an introductory tutorial to it.

pandas provides fast, flexible and expressive data structures designed to make working with "relational" or "tagged" data simple and intuitive. It is intended to be a high-level building block for practical data analysis in Python.

Getting Started

pandas is suitable for many different types of data, including:

  • Tabular data with heterogeneous type columns, such as SQL tables or Excel data

  • Ordered and unordered (not necessarily fixed frequency) time series data.

  • Arbitrary matrix data (uniform type or different types) with row and column labels

  • Any other form of observation/statistical data set.


Since this is a Python language software package, you need to have a Python language environment on your machine first. . Regarding this, please search the Internet for how to obtain it yourself.

For how to obtain pandas, please refer to the instructions on the official website: pandas Installation.

Normally, we can perform installation through pip:

sudo pip3 install pandas


or through conda Install pandas:

conda install pandas


Currently (February 2018) the latest version of pandas is v0.22.0 (release time: December 29, 2017).

I have put the source code and test data of this article on Github: pandas_tutorial, readers can get it.

In addition, pandas is often used together with NumPy, and NumPy is also used in the source code in this article.

It is recommended that readers have some familiarity with NumPy before learning pandas. I have also written a basic tutorial on NumPy before, see here: Python Machine Learning Library NumPy Tutorial

Core Data Structure

The core of pandas are the two data structures Series and DataFrame.

The comparison of these two types of data structures is as follows:

DataFrame can be regarded as a container of Series, that is: a DataFrame can contain several Series .

Note: Before version 0.20.0, there was a three-dimensional data structure named Panel. This is also the reason why the pandas library is named: pan-da-s. However, this data structure has been abandoned because it is rarely used.

Series

Since Series is a one-dimensional structured data, we can create this kind of data directly through an array , like this:

# data_structure.py
import pandas as pd
import numpy as np
series1 = pd.Series([1, 2, 3, 4])
print("series1:\n{}\n".format(series1))

The output of this code is as follows:

series1:
0 1
1 2
2 3
3 4
dtype: int64

This paragraph The output description is as follows:

  • The last line of output is the type of data in the Series, and the data here are all int64 types.

  • The data is output in the second column, and the first column is the index of the data, which is called Index in pandas.


We can print out the data and index in the Series respectively:

# data_structure.py
print("series1.values: {}\n".format(series1.values))
print("series1.index: {}\n".format(series1.index))

The output of these two lines of code is as follows:

series1.values: [1 2 3 4]
series1.index: RangeIndex(start=0, stop=4, step=1)

If not specified (like above), the index is [1, N- 1] form. However, we can also specify the index when creating the Series. The index does not necessarily need to be an integer, it can be any type of data, such as a string. For example, we use seven letters to map seven musical notes. The purpose of the index is to obtain the corresponding data, such as the following:

# data_structure.py
series2 = pd.Series([1, 2, 3, 4, 5, 6, 7],
 index=["C", "D", "E", "F", "G", "A", "B"])
print("series2:\n{}\n".format(series2))
print("E is {}\n".format(series2["E"]))

The output of this code is as follows:

series2:
C 1
D 2
E 3
F 4
G 5
A 6
B 7
dtype: int64
E is 3
DataFrame

Let’s take a look at the creation of DataFrame. We can create a 4x4 matrix through the NumPy interface to create a DataFrame, like this:

# data_structure.py
df1 = pd.DataFrame(np.arange(16).reshape(4,4))
print("df1:\n{}\n".format(df1))

The output of this code is as follows:

df1:
 0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15

We can see from this output that the default index and column names are in the form [0, N-1].

We can specify column names and indexes when creating a DataFrame, like this:

# data_structure.py
df2 = pd.DataFrame(np.arange(16).reshape(4,4),
 columns=["column1", "column2", "column3", "column4"],
 index=["a", "b", "c", "d"])
print("df2:\n{}\n".format(df2))

The output of this code is as follows:

df2:
 column1 column2 column3 column4
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15

We can also directly specify column data to create a DataFrame:

# data_structure.py

df3 = pd.DataFrme({"note" : ["C", "D", "E", "F", "G", "A", "B"],
 "weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]})
print("df3:\n{}\n".format(df3))

The output of this code is as follows :

df3:
 note weekday
0 C Mon
1 D Tue
2 E Wed
3 F Thu
4 G Fri
5 A Sat
6 B Sun


Please note:

Different columns of DataFrame can be different data types

If you create a DataFrame with a Series array, each Series will become a row instead of a column

For example:

# data_structure.py
noteSeries = pd.Series(["C", "D", "E", "F", "G", "A", "B"],
 index=[1, 2, 3, 4, 5, 6, 7])
weekdaySeries = pd.Series(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
 index=[1, 2, 3, 4, 5, 6, 7])
df4 = pd.DataFrame([noteSeries, weekdaySeries])
print("df4:\n{}\n".format(df4))

df4的输出如下:

df4:
 1 2 3 4 5 6 7
0 C D E F G A B
1 Mon Tue Wed Thu Fri Sat Sun

我们可以通过下面的形式给DataFrame添加或者删除列数据:

# data_structure.py
df3["No."] = pd.Series([1, 2, 3, 4, 5, 6, 7])
print("df3:\n{}\n".format(df3))
del df3["weekday"]
print("df3:\n{}\n".format(df3))


这段代码输出如下:

df3:
 note weekday No.
0 C Mon 1
1 D Tue 2
2 E Wed 3
3 F Thu 4
4 G Fri 5
5 A Sat 6
6 B Sun 7
df3:
 note No.
0 C 1
1 D 2
2 E 3
3 F 4
4 G 5
5 A 6
6 B 7

Index对象与数据访问

pandas的Index对象包含了描述轴的元数据信息。当创建Series或者DataFrame的时候,标签的数组或者序列会被转换成Index。可以通过下面的方式获取到DataFrame的列和行的Index对象:

# data_structure.py
print("df3.columns\n{}\n".format(df3.columns))
print("df3.index\n{}\n".format(df3.index))

这两行代码输出如下:

df3.columns
Index(['note', 'No.'], dtype='object')
df3.index
RangeIndex(start=0, stop=7, step=1)


请注意:

  • Index并非集合,因此其中可以包含重复的数据

  • Index对象的值是不可以改变,因此可以通过它安全的访问数据


DataFrame提供了下面两个操作符来访问其中的数据:

  • loc:通过行和列的索引来访问数据

  • iloc:通过行和列的下标来访问数据

例如这样:

# data_structure.py
print("Note C, D is:\n{}\n".format(df3.loc[[0, 1], "note"]))
print("Note C, D is:\n{}\n".format(df3.iloc[[0, 1], 0]))

第一行代码访问了行索引为0和1,列索引为“note”的元素。第二行代码访问了行下标为0和1(对于df3来说,行索引和行下标刚好是一样的,所以这里都是0和1,但它们却是不同的含义),列下标为0的元素。

这两行代码输出如下:

Note C, D is:
0 C
1 D
Name: note, dtype: object

Note C, D is:
0 C
1 D
Name: note, dtype: object

文件操作

pandas库提供了一系列的read_函数来读取各种格式的文件,它们如下所示:

  • read_csv

  • read_table

  • read_fwf

  • read_clipboard

  • read_excel

  • read_hdf

  • read_html

  • read_json

  • read_msgpack

  • read_pickle

  • read_sas

  • read_sql

  • read_stata

  • read_feather


读取Excel文件

注:要读取Excel文件,还需要安装另外一个库:xlrd

通过pip可以这样完成安装:

sudo pip3 install xlrd


安装完之后可以通过pip查看这个库的信息:

$ pip3 show xlrd
Name: xlrd
Version: 1.1.0
Summary: Library for developers to extract data from Microsoft Excel (tm) spreadsheet files
Home-page: http://www.python-excel.org/
Author: John Machin
Author-email: sjmachin@lexicon.net
License: BSD
Location: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages
Requires:


接下来我们看一个读取Excel的简单的例子:

# file_operation.py
import pandas as pd
import numpy as np
df1 = pd.read_excel("data/test.xlsx")
print("df1:\n{}\n".format(df1))


这个Excel的内容如下:

df1:
 C Mon
0 D Tue
1 E Wed
2 F Thu
3 G Fri
4 A Sat
5 B Sun

注:本文的代码和数据文件可以通过文章开头提到的Github仓库获取。

读取CSV文件

下面,我们再来看读取CSV文件的例子。

第一个CSV文件内容如下:

$ cat test1.csv 
C,Mon
D,Tue
E,Wed
F,Thu
G,Fri
A,Sat

读取的方式也很简单:

# file_operation.py
df2 = pd.read_csv("data/test1.csv")
print("df2:\n{}\n".format(df2))

我们再来看第2个例子,这个文件的内容如下:

$ cat test2.csv 
C|Mon
D|Tue
E|Wed
F|Thu
G|Fri
A|Sat


严格的来说,这并不是一个CSV文件了,因为它的数据并不是通过逗号分隔的。在这种情况下,我们可以通过指定分隔符的方式来读取这个文件,像这样:

# file_operation.py
df3 = pd.read_csv("data/test2.csv", sep="|")
print("df3:\n{}\n".format(df3))

实际上,read_csv支持非常多的参数用来调整读取的参数,如下表所示:


##pathFile pathsep or delimiterField delimiterheaderThe number of rows in the column name, the default is 0 ( First row) index_colThe column number or name is used as the row index in the resultnamesList of column names of the resultskiprowsNumber of rows to skip from the starting positionna_valuesThe sequence of values ​​that replaces commentThe character that separates comments at the end of the line
Parameters Description
NA
parse_dates Attempts to parse the data into datetime. Defaults to False
keep_date_col If concatenating a column to a parsed date, keep the concatenated column. Default is False.
converters Converters for columns
dayfirst When parsing dates that can cause ambiguity , stored in internal form.The default is False
data_parser Function used to parse dates
nrows The number of lines to read from the beginning of the file
iterator Returns a TextParser object for reading part of the content
chunksize Specify the size of the read chunk
skip_footer The number of lines to be ignored at the end of the file
verbose Output various parsing output information
encoding File encoding
squeeze If the parsed data contains only one column, a Series
thousands thousands delimiter is returned

详细的read_csv函数说明请参见这里:pandas.read_csv

处理无效值

现实世界并非完美,我们读取到的数据常常会带有一些无效值。如果没有处理好这些无效值,将对程序造成很大的干扰。

对待无效值,主要有两种处理方法:直接忽略这些无效值;或者将无效值替换成有效值。

下面我先创建一个包含无效值的数据结构。然后通过pandas.isna函数来确认哪些值是无效的:

# process_na.py
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.0, np.nan, 3.0, 4.0],
   [5.0, np.nan, np.nan, 8.0],
   [9.0, np.nan, np.nan, 12.0],
   [13.0, np.nan, 15.0, 16.0]])
print("df:\n{}\n".format(df));
print("df:\n{}\n".format(pd.isna(df)));****

这段代码输出如下:

df:
 0 1 2 3
0 1.0 NaN 3.0 4.0
1 5.0 NaN NaN 8.0
2 9.0 NaN NaN 12.0
3 13.0 NaN 15.0 16.0
df:
 0 1 2 3
0 False True False False
1 False True True False
2 False True True False
3 False True False False

忽略无效值

我们可以通过pandas.DataFrame.dropna函数抛弃无效值:

# process_na.py
print("df.dropna():\n{}\n".format(df.dropna()));


注:dropna默认不会改变原先的数据结构,而是返回了一个新的数据结构。如果想要直接更改数据本身,可以在调用这个函数的时候传递参数 inplace = True。

对于原先的结构,当无效值全部被抛弃之后,将不再是一个有效的DataFrame,因此这行代码输出如下:

df.dropna():
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []


我们也可以选择抛弃整列都是无效值的那一列:

# process_na.py
print("df.dropna(axis=1, how='all'):\n{}\n".format(df.dropna(axis=1, how='all')));


注:axis=1表示列的轴。how可以取值'any'或者'all',默认是前者。

这行代码输出如下:

df.dropna(axis=1, how='all'):
 0 2 3
0 1.0 3.0 4.0
1 5.0 NaN 8.0
2 9.0 NaN 12.0
3 13.0 15.0 16.0

替换无效值

我们也可以通过fillna函数将无效值替换成为有效值。像这样:

# process_na.py
print("df.fillna(1):\n{}\n".format(df.fillna(1)));

这段代码输出如下:

df.fillna(1):
  0 1  2  3
0 1.0 1.0 3.0 4.0
1 5.0 1.0 1.0 8.0
2 9.0 1.0 1.0 12.0
3 13.0 1.0 15.0 16.0


将无效值全部替换成同样的数据可能意义不大,因此我们可以指定不同的数据来进行填充。为了便于操作,在填充之前,我们可以先通过rename方法修改行和列的名称:

# process_na.py

df.rename(index={0: 'index1', 1: 'index2', 2: 'index3', 3: 'index4'},
   columns={0: 'col1', 1: 'col2', 2: 'col3', 3: 'col4'},
   inplace=True);
df.fillna(value={'col2': 2}, inplace=True)
df.fillna(value={'col3': 7}, inplace=True)
print("df:\n{}\n".format(df));

这段代码输出如下:

df:
  col1 col2 col3 col4
index1 1.0 2.0 3.0 4.0
index2 5.0 2.0 7.0 8.0
index3 9.0 2.0 7.0 12.0
index4 13.0 2.0 15.0 16.0


处理字符串

数据中常常牵涉到字符串的处理,接下来我们就看看pandas对于字符串操作。

Series的str字段包含了一系列的函数用来处理字符串。并且,这些函数会自动处理无效值。

下面是一些实例,在第一组数据中,我们故意设置了一些包含空格字符串:

# process_string.py
import pandas as pd
s1 = pd.Series([' 1', '2 ', ' 3 ', '4', '5']);
print("s1.str.rstrip():\n{}\n".format(s1.str.lstrip()))
print("s1.str.strip():\n{}\n".format(s1.str.strip()))
print("s1.str.isdigit():\n{}\n".format(s1.str.isdigit()))

在这个实例中我们看到了对于字符串strip的处理以及判断字符串本身是否是数字,这段代码输出如下:

s1.str.rstrip():
0  1
1 2 
2 3 
3  4
4  5
dtype: object
s1.str.strip():
0 1
1 2
2 3
3 4
4 5
dtype: object
s1.str.isdigit():
0 False
1 False
2 False
3  True
4  True
dtype: bool


下面是另外一些示例,展示了对于字符串大写,小写以及字符串长度的处理:

# process_string.py
s2 = pd.Series(['Stairway to Heaven', 'Eruption', 'Freebird',
     'Comfortably Numb', 'All Along the Watchtower'])
print("s2.str.lower():\n{}\n".format(s2.str.lower()))
print("s2.str.upper():\n{}\n".format(s2.str.upper()))
print("s2.str.len():\n{}\n".format(s2.str.len()))


该段代码输出如下:

s2.str.lower():
0   stairway to heaven
1     eruption
2     freebird
3   comfortably numb
4 all along the watchtower
dtype: object

s2.str.upper():
0   STAIRWAY TO HEAVEN
1     ERUPTION
2     FREEBIRD
3   COMFORTABLY NUMB
4 ALL ALONG THE WATCHTOWER
dtype: object

s2.str.len():
0 18
1  8
2  8
3 16
4 24
dtype: int64

相关推荐:

Python数据分析库pandas基本操作方法_python

python之Numpy和Pandas的使用介绍

The above is the detailed content of Getting started with Python data processing library pandas. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How do you slice a Python array?How do you slice a Python array?May 01, 2025 am 12:18 AM

The basic syntax for Python list slicing is list[start:stop:step]. 1.start is the first element index included, 2.stop is the first element index excluded, and 3.step determines the step size between elements. Slices are not only used to extract data, but also to modify and invert lists.

Under what circumstances might lists perform better than arrays?Under what circumstances might lists perform better than arrays?May 01, 2025 am 12:06 AM

Listsoutperformarraysin:1)dynamicsizingandfrequentinsertions/deletions,2)storingheterogeneousdata,and3)memoryefficiencyforsparsedata,butmayhaveslightperformancecostsincertainoperations.

How can you convert a Python array to a Python list?How can you convert a Python array to a Python list?May 01, 2025 am 12:05 AM

ToconvertaPythonarraytoalist,usethelist()constructororageneratorexpression.1)Importthearraymoduleandcreateanarray.2)Uselist(arr)or[xforxinarr]toconvertittoalist,consideringperformanceandmemoryefficiencyforlargedatasets.

What is the purpose of using arrays when lists exist in Python?What is the purpose of using arrays when lists exist in Python?May 01, 2025 am 12:04 AM

ChoosearraysoverlistsinPythonforbetterperformanceandmemoryefficiencyinspecificscenarios.1)Largenumericaldatasets:Arraysreducememoryusage.2)Performance-criticaloperations:Arraysofferspeedboostsfortaskslikeappendingorsearching.3)Typesafety:Arraysenforc

Explain how to iterate through the elements of a list and an array.Explain how to iterate through the elements of a list and an array.May 01, 2025 am 12:01 AM

In Python, you can use for loops, enumerate and list comprehensions to traverse lists; in Java, you can use traditional for loops and enhanced for loops to traverse arrays. 1. Python list traversal methods include: for loop, enumerate and list comprehension. 2. Java array traversal methods include: traditional for loop and enhanced for loop.

What is Python Switch Statement?What is Python Switch Statement?Apr 30, 2025 pm 02:08 PM

The article discusses Python's new "match" statement introduced in version 3.10, which serves as an equivalent to switch statements in other languages. It enhances code readability and offers performance benefits over traditional if-elif-el

What are Exception Groups in Python?What are Exception Groups in Python?Apr 30, 2025 pm 02:07 PM

Exception Groups in Python 3.11 allow handling multiple exceptions simultaneously, improving error management in concurrent scenarios and complex operations.

What are Function Annotations in Python?What are Function Annotations in Python?Apr 30, 2025 pm 02:06 PM

Function annotations in Python add metadata to functions for type checking, documentation, and IDE support. They enhance code readability, maintenance, and are crucial in API development, data science, and library creation.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment