The techniques compiled in this article are different from the common techniques compiled in 10 Pandas before. You may not use it often, but sometimes when you encounter some very difficult problems, these techniques can help you quickly Solve some uncommon problems.
1. Categorical type
By default, columns with a limited number of options will be assigned the object type. But it's not an efficient choice in terms of memory. We can index these columns and use only references to the objects and not the actual values. Pandas provides a Dtype called Categorical to solve this problem.
For example, it consists of a large data set with image paths. Each row has three columns: anchor, positive, and negative.
If you use Categorical for categorical columns, you can significantly reduce memory usage.
# raw data +----------+------------------------+ |class |filename| +----------+------------------------+ | Bathroom | Bathroombath_1.jpg| | Bathroom | Bathroombath_100.jpg| | Bathroom | Bathroombath_1003.jpg | | Bathroom | Bathroombath_1004.jpg | | Bathroom | Bathroombath_1005.jpg | +----------+------------------------+ # target +------------------------+------------------------+----------------------------+ | anchor |positive|negative| +------------------------+------------------------+----------------------------+ | Bathroombath_1.jpg| Bathroombath_100.jpg| Dinningdin_540.jpg| | Bathroombath_100.jpg| Bathroombath_1003.jpg | Dinningdin_1593.jpg | | Bathroombath_1003.jpg | Bathroombath_1004.jpg | Bedroombed_329.jpg| | Bathroombath_1004.jpg | Bathroombath_1005.jpg | Livingroomliving_1030.jpg | | Bathroombath_1005.jpg | Bathroombath_1007.jpg | Bedroombed_1240.jpg | +------------------------+------------------------+----------------------------+
The value of the filename column will be copied frequently. Therefore, memory usage can be greatly reduced by using Categorical.
Let's read the target data set and see the difference in memory:
triplets.info(memory_usage="deep") # Column Non-Null Count Dtype # --- ------ -------------- ----- # 0 anchor 525000 non-null category # 1 positive 525000 non-null category # 2 negative 525000 non-null category # dtypes: category(3) # memory usage: 4.6 MB # without categories triplets_raw.info(memory_usage="deep") # Column Non-Null Count Dtype # --- ------ -------------- ----- # 0 anchor 525000 non-null object # 1 positive 525000 non-null object # 2 negative 525000 non-null object # dtypes: object(3) # memory usage: 118.1 MB
The difference is very large, and the difference grows non-linearly as the number of repetitions increases.
2. Row-column conversion
The problem of row-column conversion is often encountered in sql. Pandas sometimes also needs it. Let's take a look at the data set from the Kaggle competition. census_start .csv file:
As you can see, these are saved by year. If there is a column year and pct_bb, and each row has a corresponding value, it will be better A lot, right.
cols = sorted([col for col in original_df.columns if col.startswith("pct_bb")]) df = original_df[(["cfips"] + cols)] df = df.melt(id_vars="cfips", value_vars=cols, var_name="year", value_name="feature").sort_values(by=["cfips", "year"])
Look at the result, is this much better:
As we introduced last time, it is best not to use this method because it iterates through each row and calls the specified method. But if we have no other choice, is there any way to increase the speed? You can use packages such as swifter or pandarallew to parallelize the process. Swifter
import pandas as pd
import swifter
def target_function(row):
return row * 10
def traditional_way(data):
data['out'] = data['in'].apply(target_function)
def swifter_way(data):
data['out'] = data['in'].swifter.apply(target_function)
Pandaralllel import pandas as pd
from pandarallel import pandarallel
def target_function(row):
return row * 10
def traditional_way(data):
data['out'] = data['in'].apply(target_function)
def pandarallel_way(data):
pandarallel.initialize()
data['out'] = data['in'].parallel_apply(target_function)
Through multi-threading, the speed of calculation can be improved. Of course, if there is a cluster, it is best to use dask or pyspark4. Null value, int, Int64The standard integer data type does not support null value, so it will be automatically converted to a floating point number. So if your data requires null values in integer fields, consider using the Int64 data type as it will use pandas.NA to represent null values. 5. Csv, compression or parquet? Choose parquet as much as possible. Parquet will retain the data type, so there is no need to specify dtypes when reading data. Parquet files are compressed using snappy by default, so they take up little disk space. Below you can see a few comparisons|file|size | +------------------------+---------+ | triplets_525k.csv| 38.4 MB | | triplets_525k.csv.gzip |4.3 MB | | triplets_525k.csv.zip|4.5 MB | | triplets_525k.parquet|1.9 MB | +------------------------+---------+Reading parquet requires additional packages, such as pyarrow or fastparquet. chatgpt said that pyarrow is faster than fastparquet, but when I tested on a small data set, fastparquet was faster than pyarrow, but it is recommended to use pyarrow here, because pandas 2.0 also uses this by default. 6, value_counts ()Calculating relative frequencies, including getting the absolute value, counting, and dividing by the total is complex, but using value_counts, this task can be accomplished more easily, and This method provides the option to include or exclude null values.
df = pd.DataFrame({"a": [1, 2, None], "b": [4., 5.1, 14.02]}) df["a"] = df["a"].astype("Int64") print(df.info()) print(df["a"].value_counts(normalize=True, dropna=False), df["a"].value_counts(normalize=True, dropna=True), sep="nn")
!pip install modin[all] import modin.pandas as pd df = pd.read_csv("my_dataset.csv")The following is the architecture diagram of modin’s official website. If you are interested in studying it:
If you often encounter complex semi-structured data and need to separate individual columns from it, you can use this method:
import pandas as pd regex = (r'(?P<title>[A-Za-z's]+),' r'(?P<author>[A-Za-zs']+),' r'(?P<isbn>[d-]+),' r'(?P<year>d{4}),' r'(?P<publisher>.+)') addr = pd.Series([ "The Lost City of Amara,Olivia Garcia,978-1-234567-89-0,2023,HarperCollins", "The Alchemist's Daughter,Maxwell Greene,978-0-987654-32-1,2022,Penguin Random House", "The Last Voyage of the HMS Endeavour,Jessica Kim,978-5-432109-87-6,2021,Simon & Schuster", "The Ghosts of Summer House,Isabella Lee,978-3-456789-12-3,2000,Macmillan Publishers", "The Secret of the Blackthorn Manor,Emma Chen,978-9-876543-21-0,2023,Random House Children's Books" ]) addr.str.extract(regex)
9、读写剪贴板
这个技巧有人一次也用不到,但是有人可能就是需要,比如:在分析中包含PDF文件中的表格时。通常的方法是复制数据,粘贴到Excel中,导出到csv文件中,然后导入Pandas。但是,这里有一个更简单的解决方案:pd.read_clipboard()。我们所需要做的就是复制所需的数据并执行一个方法。
有读就可以写,所以还可以使用to_clipboard()方法导出到剪贴板。
但是要记住,这里的剪贴板是你运行python/jupyter主机的剪切板,并不可能跨主机粘贴,一定不要搞混了。
10、数组列分成多列
假设我们有这样一个数据集,这是一个相当典型的情况:
import pandas as pd df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "category": [["foo", "bar"], ["foo"], ["qux"]]}) # let's increase the number of rows in a dataframe df = pd.concat([df]*10000, ignore_index=True)
我们想将category分成多列显示,例如下面的
先看看最慢的apply:
def dummies_series_apply(df): return df.join(df['category'].apply(pd.Series) .stack() .str.get_dummies() .groupby(level=0) .sum()) .drop("category", axis=1) %timeit dummies_series_apply(df.copy()) #5.96 s ± 66.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sklearn的MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer def sklearn_mlb(df): mlb = MultiLabelBinarizer() return df.join(pd.DataFrame(mlb.fit_transform(df['category']), columns=mlb.classes_)) .drop("category", axis=1) %timeit sklearn_mlb(df.copy()) #35.1 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
是不是快了很多,我们还可以使用一般的向量化操作对其求和:
def dummies_vectorized(df): return pd.get_dummies(df.explode("category"), prefix="cat") .groupby(["a", "b"]) .sum() .reset_index() %timeit dummies_vectorized(df.copy()) #29.3 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
使用第一个方法(在StackOverflow上的回答中非常常见)会给出一个非常慢的结果。而其他两个优化的方法的时间是非常快速的。
总结
我希望每个人都能从这些技巧中学到一些新的东西。重要的是要记住尽可能使用向量化操作而不是apply()。此外,除了csv之外,还有其他有趣的存储数据集的方法。不要忘记使用分类数据类型,它可以节省大量内存。感谢阅读!
The above is the detailed content of Ten alternative data processing techniques for Pandas. For more information, please follow other related articles on the PHP Chinese website!

python可以通过使用pip、使用conda、从源代码、使用IDE集成的包管理工具来安装pandas。详细介绍:1、使用pip,在终端或命令提示符中运行pip install pandas命令即可安装pandas;2、使用conda,在终端或命令提示符中运行conda install pandas命令即可安装pandas;3、从源代码安装等等。

CSV(逗号分隔值)文件广泛用于以简单格式存储和交换数据。在许多数据处理任务中,需要基于特定列合并两个或多个CSV文件。幸运的是,这可以使用Python中的Pandas库轻松实现。在本文中,我们将学习如何使用Python中的Pandas按特定列合并两个CSV文件。什么是Pandas库?Pandas是一个用于Python信息控制和检查的开源库。它提供了用于处理结构化数据(例如表格、时间序列和多维数据)以及高性能数据结构的工具。Pandas广泛应用于金融、数据科学、机器学习和其他需要数据操作的领域。

pandas写入excel的方法有:1、安装所需的库;2、读取数据集;3、写入Excel文件;4、指定工作表名称;5、格式化输出;6、自定义样式。Pandas是一个流行的Python数据分析库,提供了许多强大的数据清洗和分析功能,要将Pandas数据写入Excel文件,可以使用Pandas提供的“to_excel()”方法。

知乎上有个热门提问,日常工作中Python+Pandas是否能代替Excel+VBA?我的建议是,两者是互补关系,不存在谁替代谁。复杂数据分析挖掘用Python+Pandas,日常简单数据处理用Excel+VBA。从数据处理分析能力来看,Python+Pandas肯定是能取代Excel+VBA的,而且要远远比后者强大。但从便利性、传播性、市场认可度来看,Excel+VBA在职场工作上还是无法取代的。因为Excel符合绝大多数人的使用习惯,使用成本更低。就像Photoshop能修出更专业的照片,为

使用Pandas和Python从时间序列数据中提取有意义的特征,包括移动平均,自相关和傅里叶变换。前言时间序列分析是理解和预测各个行业(如金融、经济、医疗保健等)趋势的强大工具。特征提取是这一过程中的关键步骤,它涉及将原始数据转换为有意义的特征,可用于训练模型进行预测和分析。在本文中,我们将探索使用Python和Pandas的时间序列特征提取技术。在深入研究特征提取之前,让我们简要回顾一下时间序列数据。时间序列数据是按时间顺序索引的数据点序列。时间序列数据的例子包括股票价格、温度测量和交通数据。

pandas读取txt文件的步骤:1、安装Pandas库;2、使用“read_csv”函数读取txt文件,并指定文件路径和文件分隔符;3、Pandas将数据读取为一个名为DataFrame的对象;4、如果第一行包含列名,则可以通过将header参数设置为0来指定,如果没有,则设置为None;5、如果txt文件中包含缺失值或空值,可以使用“na_values”指定这些缺失值。

读取CSV文件的方法有使用read_csv()函数、指定分隔符、指定列名、跳过行、缺失值处理、自定义数据类型等。详细介绍:1、read_csv()函数是Pandas中最常用的读取CSV文件的方法。它可以从本地文件系统或远程URL加载CSV数据,并返回一个DataFrame对象;2、指定分隔符,默认情况下,read_csv()函数将使用逗号作为CSV文件的分隔符等等。

今天分享几个不为人知的pandas函数,大家可能平时看到的不多,但是使用起来倒是非常的方便,也能够帮助我们数据分析人员大幅度地提高工作效率,同时也希望大家看完之后能够有所收获。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver CS6
Visual web development tools

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment
