search
pandas basicsJun 23, 2017 pm 03:54 PM
pandasBase

pandas is a data analysis package built based on Numpy that contains more advanced data structures and tools

Similar to Numpy, whose core is ndarray, pandas also revolves around the two core data structures of Series and DataFrame. Series and DataFrame correspond to one-dimensional sequence and two-dimensional table structure respectively. The conventional import method of pandas is as follows:

from pandas import Series,DataFrame
import pandas as pd

Series


Series can be regarded as a fixed-length ordered dictionary. Basically any one-dimensional data can be used to construct Series objects:

>>> s = Series([1,2,3.0,'abc'])
>>> s
0      1
1      2
2      3
3    abc
dtype: object

Although dtype:object can contain a variety of basic data types, it always feels like it will affect performance. It is best Or keep it simple dtype.

The Series object contains two main attributes: index and values, which are the left and right columns in the above example. Because what is passed to the constructor is a list, the value of index is an integer that increases from 0. If a dictionary-like key-value pair structure is passed in, a Series corresponding to index-value will be generated; or in the initialization When using keyword parameters to explicitly specify an index object:

>>> s = Series(data=[1,3,5,7],index = ['a','b','x','y'])
>>> s
a    1
b    3
x    5
y    7
dtype: int64
>>> s.index
Index(['a', 'b', 'x', 'y'], dtype='object')
>>> s.values
array([1, 3, 5, 7], dtype=int64)

The elements of the Series object will be constructed strictly according to the given index, which means: if the data parameter has a key-value pair, then only the elements in the index The key contained will be used; and if the corresponding key is missing from data, the key will be added even if a NaN value is given.

Note that although there is a correspondence between the index of Series and the elements of values, this is different from the mapping of dictionary. Index and values ​​are actually still independent ndarray arrays, so the performance of Series objects is completely ok.

Series The biggest advantage of this data structure using key-value pairs is that the index will be automatically aligned when arithmetic operations are performed between Series.

In addition, the Series object and its index both contain a name attribute:

>>> s.name = 'a_series'
>>> s.index.name = 'the_index'
>>> s
the_index
a            1
b            3
x            5
y            7
Name: a_series, dtype: int64

DataFrame


DataFrame It is a tabular data structure that contains a set of ordered columns (similar to index), and each column can be of a different value type (unlike ndarray, which can only have one dtype). Basically, you can think of a DataFrame as a collection of Series that share the same index.

The construction method of DataFrame is similar to Series, except that it can accept multiple one-dimensional data sources at the same time, and each one will become a separate column:

>>> data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],
        'year':[2000,2001,2002,2001,2002],
        'pop':[1.5,1.7,3.6,2.4,2.9]}
>>> df = DataFrame(data)
>>> df
   pop   state  year
0  1.5   Ohino  2000
1  1.7   Ohino  2001
2  3.6   Ohino  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

[5 rows x 3 columns]

Although the parameter data looks like a dictionary, The keys of the dictionary do not play the role of the index of the DataFrame, but the "name" attribute of the Series. The index generated here is still "01234".

The more complete DataFrame constructor parameters are: DataFrame(data=None,index=None,coloumns=None), columns is "name":

>>> df = DataFrame(data,index=['one','two','three','four','five'],
               columns=['year','state','pop','debt'])
>>> df
       year   state  pop debt
one    2000   Ohino  1.5  NaN
two    2001   Ohino  1.7  NaN
three  2002   Ohino  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

[5 rows x 4 columns]

Similarly Missing values ​​are filled with NaN. Take a look at index, columns and index types:

>>> df.index
Index(['one', 'two', 'three', 'four', 'five'], dtype='object')
>>> df.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
>>> type(df['debt'])
<class &#39;pandas.core.series.Series&#39;>

DataFrame row-oriented and column-oriented operations are basically balanced, and any column extracted is a Series.

Object properties


Reindex

Series objects are reindexed through their .reindex(index=None,**kwargs) method accomplish. There are two commonly used parameters in **kwargs: method=None,fill_value=np.NaN:

ser = Series([4.5,7.2,-5.3,3.6],index=[&#39;d&#39;,&#39;b&#39;,&#39;a&#39;,&#39;c&#39;])
>>> a = [&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;d&#39;,&#39;e&#39;]
>>> ser.reindex(a)
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
>>> ser.reindex(a,fill_value=0)
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
>>> ser.reindex(a,method=&#39;ffill&#39;)
a   -5.3
b    7.2
c    3.6
d    4.5
e    4.5
dtype: float64
>>> ser.reindex(a,fill_value=0,method=&#39;ffill&#39;)
a   -5.3
b    7.2
c    3.6
d    4.5
e    4.5
dtype: float64

.reindex() method A new object will be returned, its index strictly follows the given parameters, method:{'backfill', 'bfill', 'pad', 'ffill', None} Parameters are used to specify interpolation (filling) Method, when not given, automatically fills with fill_value, the default is NaN (ffill = pad, bfill = back fill, respectively refers to the forward or backward value during interpolation)

The reindexing method of the DataFrame object is: .reindex(index=None,columns=None,**kwargs). There is only one more optional columns parameter than Series, which is used to index the columns. The usage is similar to the above example, except that the interpolation method method parameter can only be applied to rows, that is, axis 0.

>>> state = [&#39;Texas&#39;,&#39;Utha&#39;,&#39;California&#39;]
>>> df.reindex(columns=state,method=&#39;ffill&#39;)
    Texas  Utha  California
a      1   NaN           2
c      4   NaN           5  
d      7   NaN           8

[3 rows x 3 columns]
>>> df.reindex(index=[&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;d&#39;],columns=state,method=&#39;ffill&#39;)
   Texas  Utha  California
a      1   NaN           2
b      1   NaN           2
c      4   NaN           5
d      7   NaN           8

[4 rows x 3 columns]

But fill_value is still valid. Smart friends may have already thought about it, can we implement interpolation on columns through df.T.reindex(index,method='**').T? The answer is yes. of. Also note that when using reindex(index,method='**'), index must be monotonic, otherwise it will trigger a ValueError: Must be monotonic for forward fill , for example, the last call in the above example will not work if index=['a','b','d','c'] is used.

Deleting items on the specified axis

means deleting an element of the Series or a certain row (column) of the DataFrame, through the object's .drop(labels, axis=0) Method:

>>> ser
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
>>> df
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

[3 rows x 3 columns]
>>> ser.drop(&#39;c&#39;)
d    4.5
b    7.2
a   -5.3
dtype: float64
>>> df.drop(&#39;a&#39;)
   Ohio  Texas  California
c     3      4           5
d     6      7           8

[2 rows x 3 columns]
>>> df.drop([&#39;Ohio&#39;,&#39;Texas&#39;],axis=1)
   California
a           2
c           5
d           8

[3 rows x 1 columns]

.drop() Returns a new object, and the meta object will not be changed.

Indexing and slicing

Like Numpy, pandas also supports indexing and slicing through obj[::], as well as filtering through boolean arrays.

However, it should be noted that because the index of the pandas object is not limited to integers, when using a non-integer as the slice index, it is included at the end.

>>> foo
a    4.5
b    7.2
c   -5.3
d    3.6
dtype: float64
>>> bar
0    4.5
1    7.2
2   -5.3
3    3.6
dtype: float64
>>> foo[:2]
a    4.5
b    7.2
dtype: float64
>>> bar[:2]
0    4.5
1    7.2
dtype: float64
>>> foo[:&#39;c&#39;]
a    4.5
b    7.2
c   -5.3
dtype: float64

这里 foo 和 bar 只有 index 不同——bar 的 index 是整数序列。可见当使用整数索引切片时,结果与 Python 列表或 Numpy 的默认状况相同;换成 'c' 这样的字符串索引时,结果就包含了这个边界元素。

另外一个特别之处在于 DataFrame 对象的索引方式,因为他有两个轴向(双重索引)。

可以这么理解:DataFrame 对象的标准切片语法为:.ix[::,::]。ix 对象可以接受两套切片,分别为行(axis=0)和列(axis=1)的方向:

>>> df
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

[3 rows x 3 columns]
>>> df.ix[:2,:2]
   Ohio  Texas
a     0      1
c     3      4

[2 rows x 2 columns]
>>> df.ix[&#39;a&#39;,&#39;Ohio&#39;]
0

而不使用 ix ,直接切的情况就特殊了:

  • 索引时,选取的是列

  • 切片时,选取的是行

这看起来有点不合逻辑,但作者解释说 “这种语法设定来源于实践”,我们信他。

>>> df[&#39;Ohio&#39;]
a    0
c    3
d    6
Name: Ohio, dtype: int32
>>> df[:&#39;c&#39;]
   Ohio  Texas  California
a     0      1           2
c     3      4           5

[2 rows x 3 columns]
>>> df[:2]
   Ohio  Texas  California
a     0      1           2
c     3      4           5

[2 rows x 3 columns]

使用布尔型数组的情况,注意行与列的不同切法(列切法的 : 不能省):

>>> df[&#39;Texas&#39;]>=4
a    False
c     True
d     True
Name: Texas, dtype: bool
>>> df[df[&#39;Texas&#39;]>=4]
   Ohio  Texas  California
c     3      4           5
d     6      7           8

[2 rows x 3 columns]
>>> df.ix[:,df.ix[&#39;c&#39;]>=4]
   Texas  California
a      1           2
c      4           5
d      7           8

[3 rows x 2 columns]

 

算术运算和数据对齐

pandas 最重要的一个功能是,它可以对不同索引的对象进行算术运算。在将对象相加时,结果的索引取索引对的并集。自动的数据对齐在不重叠的索引处引入空值,默认为 NaN。

>>> foo = Series({&#39;a&#39;:1,&#39;b&#39;:2})
>>> foo
a    1
b    2
dtype: int64
>>> bar = Series({&#39;b&#39;:3,&#39;d&#39;:4})
>>> bar
b    3
d    4
dtype: int64
>>> foo + bar
a   NaN
b     5
d   NaN
dtype: float64

DataFrame 的对齐操作会同时发生在行和列上。

当不希望在运算结果中出现 NA 值时,可以使用前面 reindex 中提到过 fill_value 参数,不过为了传递这个参数,就需要使用对象的方法,而不是操作符:df1.add(df2,fill_value=0)。其他算术方法还有:sub(), div(), mul()

Series 和 DataFrame 之间的算术运算涉及广播,暂时先不讲。 

函数应用和映射

Numpy 的 ufuncs(元素级数组方法)也可用于操作 pandas 对象。

当希望将函数应用到 DataFrame 对象的某一行或列时,可以使用 .apply(func, axis=0, args=(), **kwds) 方法。

f = lambda x:x.max()-x.min()
>>> df
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

[3 rows x 3 columns]
>>> df.apply(f)
Ohio          6
Texas         6
California    6
dtype: int64
>>> df.apply(f,axis=1)
a    2
c    2
d    2
dtype: int64

 

排序和排名

Series 的 sort_index(ascending=True) 方法可以对 index 进行排序操作,ascending 参数用于控制升序或降序,默认为升序。

若要按值对 Series 进行排序,当使用 .order() 方法,任何缺失值默认都会被放到 Series 的末尾。

在 DataFrame 上,.sort_index(axis=0, by=None, ascending=True) 方法多了一个轴向的选择参数与一个 by 参数,by 参数的作用是针对某一(些)列进行排序(不能对行使用 by 参数):

>>> df.sort_index(by=&#39;Ohio&#39;)
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

[3 rows x 3 columns]
>>> df.sort_index(by=[&#39;California&#39;,&#39;Texas&#39;])
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

[3 rows x 3 columns]
>>> df.sort_index(axis=1)
   California  Ohio  Texas
a           2     0      1
c           5     3      4
d           8     6      7

[3 rows x 3 columns]

排名(Series.rank(method='average', ascending=True))的作用与排序的不同之处在于,他会把对象的 values 替换成名次(从 1 到 n)。这时唯一的问题在于如何处理平级项,方法里的 method 参数就是起这个作用的,他有四个值可选:average, min, max, first

>>> ser=Series([3,2,0,3],index=list(&#39;abcd&#39;))
>>> ser
a    3
b    2
c    0
d    3
dtype: int64
>>> ser.rank()
a    3.5
b    2.0
c    1.0
d    3.5
dtype: float64
>>> ser.rank(method=&#39;min&#39;)
a    3
b    2
c    1
d    3
dtype: float64
>>> ser.rank(method=&#39;max&#39;)
a    4
b    2
c    1
d    4
dtype: float64
>>> ser.rank(method=&#39;first&#39;)
a    3
b    2
c    1
d    4
dtype: float64

注意在 ser[0]=ser[3] 这对平级项上,不同 method 参数表现出的不同名次。

DataFrame 的 .rank(axis=0, method='average', ascending=True) 方法多了个 axis 参数,可选择按行或列分别进行排名,暂时好像没有针对全部元素的排名方法。 

统计方法

pandas 对象有一些统计方法。它们大部分都属于约简和汇总统计,用于从 Series 中提取单个值,或从 DataFrame 的行或列中提取一个 Series。

比如 DataFrame.mean(axis=0,skipna=True) 方法,当数据集中存在 NA 值时,这些值会被简单跳过,除非整个切片(行或列)全是 NA,如果不想这样,则可以通过 skipna=False 来禁用此功能:

>>> df
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

[4 rows x 2 columns]
>>> df.mean()
one    3.083333
two   -2.900000
dtype: float64
>>> df.mean(axis=1)
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64
>>> df.mean(axis=1,skipna=False)
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

其他常用的统计方法有: 

##******** ************************************countNumber of non-NA valuesdescribeCompute summary statistics for columns of Series or DF##argmin, argmaxIndex position of the minimum and maximum values ​​(integer)idxmin, idxmaxIndex values ​​of the minimum and maximum valuesquantileSample quantile (0 to 1 )sumSummeanmeanmedianMedianmadCalculate the average absolute dispersion based on the meanvarVariancestdStandard Deviationskew Skewness of sample values ​​(third moment)kurtKurtosis of sample values ​​(fourth moment)cumsumCumulative sum of sample valuescummin, cummaxCumulative maximum value and cumulative minimum value of sample valuescumprodCumulative product of sample valuesdiffCalculate the first difference (useful for time series)pct_change
##min , max Minimum and maximum values

Calculate percent change


##Handle missing data

The main expression of NA in pandas is np.nan. In addition, Python's built-in None will also be treated as NA.

There are four ways to handle NA:

dropna, fillna, isnull, notnull

.

is(not)null

This pair of methods performs element-level applications on the object, and then returns a Boolean array, which can generally be used for Boolean indexing.

dropnaFor a Series, dropna returns a Series containing only non-null data and index values.

The problem lies in the way DataFrame is processed, because once it is dropped, at least one row (column) must be lost. The solution here is similar to the previous one, but it still passes an additional parameter:

dropna(axis=0, how='any', thresh=None)

. The optional value of the how parameter is any or all. all discards the row (column) only if all slice elements are NA. Another interesting parameter is thresh, which is of type integer. Its function is that, for example, thresh=3, it will be retained when there are at least 3 non-NA values ​​in a row. fillna

fillna(value=None, method=None, axis=0)

In addition to the basic type, the value parameter in can also use a dictionary, so that it can be achieved Fill different columns with different values. The usage of method is the same as the previous

.reindex()
method, so I won’t go into details here.

inplace parameter

###There is a point that I haven’t mentioned before, but after writing the entire example, I found that it is quite important. Among the methods of Series and DataFrame objects, any method that modifies the array and returns a new array often has an optional parameter of ###replace=False###. If manually set to True, the original array can be replaced. ###

The above is the detailed content of pandas basics. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
python pandas安装方法python pandas安装方法Nov 22, 2023 pm 02:33 PM

python可以通过使用pip、使用conda、从源代码、使用IDE集成的包管理工具来安装pandas。详细介绍:1、使用pip,在终端或命令提示符中运行pip install pandas命令即可安装pandas;2、使用conda,在终端或命令提示符中运行conda install pandas命令即可安装pandas;3、从源代码安装等等。

日常工作中,Python+Pandas是否能代替Excel+VBA?日常工作中,Python+Pandas是否能代替Excel+VBA?May 04, 2023 am 11:37 AM

知乎上有个热门提问,日常工作中Python+Pandas是否能代替Excel+VBA?我的建议是,两者是互补关系,不存在谁替代谁。复杂数据分析挖掘用Python+Pandas,日常简单数据处理用Excel+VBA。从数据处理分析能力来看,Python+Pandas肯定是能取代Excel+VBA的,而且要远远比后者强大。但从便利性、传播性、市场认可度来看,Excel+VBA在职场工作上还是无法取代的。因为Excel符合绝大多数人的使用习惯,使用成本更低。就像Photoshop能修出更专业的照片,为

如何使用Python中的Pandas按特定列合并两个CSV文件?如何使用Python中的Pandas按特定列合并两个CSV文件?Sep 08, 2023 pm 02:01 PM

CSV(逗号分隔值)文件广泛用于以简单格式存储和交换数据。在许多数据处理任务中,需要基于特定列合并两个或多个CSV文件。幸运的是,这可以使用Python中的Pandas库轻松实现。在本文中,我们将学习如何使用Python中的Pandas按特定列合并两个CSV文件。什么是Pandas库?Pandas是一个用于Python信息控制和检查的开源库。它提供了用于处理结构化数据(例如表格、时间序列和多维数据)以及高性能数据结构的工具。Pandas广泛应用于金融、数据科学、机器学习和其他需要数据操作的领域。

时间序列特征提取的Python和Pandas代码示例时间序列特征提取的Python和Pandas代码示例Apr 12, 2023 pm 05:43 PM

使用Pandas和Python从时间序列数据中提取有意义的特征,包括移动平均,自相关和傅里叶变换。前言时间序列分析是理解和预测各个行业(如金融、经济、医疗保健等)趋势的强大工具。特征提取是这一过程中的关键步骤,它涉及将原始数据转换为有意义的特征,可用于训练模型进行预测和分析。在本文中,我们将探索使用Python和Pandas的时间序列特征提取技术。在深入研究特征提取之前,让我们简要回顾一下时间序列数据。时间序列数据是按时间顺序索引的数据点序列。时间序列数据的例子包括股票价格、温度测量和交通数据。

pandas写入excel有哪些方法pandas写入excel有哪些方法Nov 22, 2023 am 11:46 AM

pandas写入excel的方法有:1、安装所需的库;2、读取数据集;3、写入Excel文件;4、指定工作表名称;5、格式化输出;6、自定义样式。Pandas是一个流行的Python数据分析库,提供了许多强大的数据清洗和分析功能,要将Pandas数据写入Excel文件,可以使用Pandas提供的“to_excel()”方法。

pandas如何读取txt文件pandas如何读取txt文件Nov 21, 2023 pm 03:54 PM

pandas读取txt文件的步骤:1、安装Pandas库;2、使用“read_csv”函数读取txt文件,并指定文件路径和文件分隔符;3、Pandas将数据读取为一个名为DataFrame的对象;4、如果第一行包含列名,则可以通过将header参数设置为0来指定,如果没有,则设置为None;5、如果txt文件中包含缺失值或空值,可以使用“na_values”指定这些缺失值。

pandas怎么读取csv文件pandas怎么读取csv文件Dec 01, 2023 pm 04:18 PM

读取CSV文件的方法有使用read_csv()函数、指定分隔符、指定列名、跳过行、缺失值处理、自定义数据类型等。详细介绍:1、read_csv()函数是Pandas中最常用的读取CSV文件的方法。它可以从本地文件系统或远程URL加载CSV数据,并返回一个DataFrame对象;2、指定分隔符,默认情况下,read_csv()函数将使用逗号作为CSV文件的分隔符等等。

Pandas 与 PySpark 强强联手,功能与速度齐飞!Pandas 与 PySpark 强强联手,功能与速度齐飞!May 01, 2023 pm 09:19 PM

​使用Python做数据处理的数据科学家或数据从业者,对数据科学包pandas并不陌生,也不乏像云朵君一样的pandas重度使用者,项目开始写的第一行代码,大多是importpandasaspd。pandas做数据处理可以说是yyds!而他的缺点也是非常明显,pandas只能单机处理,它不能随数据量线性伸缩。例如,如果pandas试图读取的数据集大于一台机器的可用内存,则会因内存不足而失败。另外​pandas在处理大型​数据方面非常慢,虽然有像Dask或Vaex等其他库来优化提升数

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)