Two great tips to improve the efficiency of your Pandas code-AI-php.cn

Home

Technology peripherals

Two great tips to improve the efficiency of your Pandas code

王林

Jan 18, 2024 pm 08:12 PM

codedeep learningpandas

If you have ever used Pandas with tabular data, you may be familiar with the process of importing the data, cleaning and transforming it, and then using it as input to the model. However, when you need to scale and put your code into production, your Pandas pipeline will most likely start to crash and run slowly. In this article, I will share 2 tips to help you speed up Pandas code execution, improve data processing efficiency, and avoid common pitfalls.

Two great tips to improve the efficiency of your Pandas code

Tip 1: Vectorization operation

In Pandas, vectorization operation is an efficient tool that can process the entire data in a more concise way columns of boxes without looping row by row.

How does it work?

Broadcasting is a key element of vectorization operations, which allows you to intuitively manipulate objects with different shapes.

eg1: Array a with 3 elements is multiplied by scalar b, resulting in an array with the same shape as Source.

Two great tips to improve the efficiency of your Pandas code

eg2: When performing addition operation, add array a with shape (4,1) and array b with shape (3,), and the result will be An array of shape (4,3).

Two great tips to improve the efficiency of your Pandas code

There have been many articles discussing this, especially in deep learning where large-scale matrix multiplications are common. This article will discuss two brief examples.

First, let's say you want to count the number of times a given integer appears in a column. Here are 2 possible methods.

"""计算DataFrame X 中 "column_1" 列中等于目标值 target 的元素个数。参数：X: DataFrame，包含要计算的列 "column_1"。target: int，目标值。返回值：int，等于目标值 target 的元素个数。"""# 使用循环计数def count_loop(X, target: int) -> int:return sum(x == target for x in X["column_1"])# 使用矢量化操作计数def count_vectorized(X, target: int) -> int:return (X["column_1"] == target).sum()

Now suppose you have a DataFrame with a date column and want to offset it by a given number of days. The calculation using vectorized operations is as follows:

def offset_loop(X, days: int) -> pd.DataFrame:d = pd.Timedelta(days=days)X["column_const"] = [x + d for x in X["column_10"]]return Xdef offset_vectorized(X, days: int) -> pd.DataFrame:X["column_const"] = X["column_10"] + pd.Timedelta(days=days)return X

Tip 2: Iteration

「for loop」

The first and most intuitive way to iterate is to use the Python for loop.

def loop(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:res = []i_remove_col = df.columns.get_loc(remove_col)i_words_to_remove_col = df.columns.get_loc(words_to_remove_col)for i_row in range(df.shape[0]):res.append(remove_words(df.iat[i_row, i_remove_col], df.iat[i_row, i_words_to_remove_col]))return result

「apply」

def apply(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return df.apply(func=lambda x: remove_words(x[remove_col], x[words_to_remove_col]), axis=1).tolist()

On each iteration of df.apply, the provided callable function obtains a Series whose index is df.columns and whose values are rows. This means pandas has to generate the sequence in every loop, which is expensive. To reduce costs, it's best to call apply on the subset of df that you know you will use, like this:

def apply_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return df[[remove_col, words_to_remove_col]].apply(func=lambda x: remove_words(x[remove_col], x[words_to_remove_col]), axis=1)

「List combination itertuples」

Iteration using itertuples combined with lists will definitely work better. itertuples generates (named) tuples with row data.

def itertuples_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return [remove_words(x[0], x[1])for x in df[[remove_col, words_to_remove_col]].itertuples(index=False, name=None)]

「List combination zip」

zip accepts an iterable object and generates a tuple, where the i-th tuple contains all i-th elements of the given iterable object in order.

def zip_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return [remove_words(x, y) for x, y in zip(df[remove_col], df[words_to_remove_col])]

「List combination to_dict」

def to_dict_only_used_columns(df: pd.DataFrame) -> list[str]:return [remove_words(row[remove_col], row[words_to_remove_col])for row in df[[remove_col, words_to_remove_col]].to_dict(orient="records")]

「Caching」

In addition to the iteration techniques we discussed, two other methods can help improve the performance of the code: caching and Parallelization. Caching is particularly useful if you call a pandas function multiple times with the same parameters. For example, if remove_words is applied to a dataset with many duplicate values, you can use functools.lru_cache to store the results of the function and avoid recalculating them each time. To use lru_cache, simply add the @lru_cache decorator to the declaration of remove_words and then apply the function to your dataset using your preferred iteration method. This can significantly improve the speed and efficiency of your code. Take the following code as an example:

@lru_cachedef remove_words(...):... # Same implementation as beforedef zip_only_used_cols_cached(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return [remove_words(x, y) for x, y in zip(df[remove_col], df[words_to_remove_col])]

Adding this decorator generates a function that "remembers" the output of previously encountered input, eliminating the need to run all the code again.

「Parallelization」

The final trump card is to use pandarallel to parallelize our function calls across multiple independent df blocks. The tool is easy to use: you just import and initialize it, then change all .applys to .parallel_applys.

from pandarallel import pandarallelpandarallel.initialize(nb_workers=min(os.cpu_count(), 12))def parapply_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return df[[remove_col, words_to_remove_col]].parallel_apply(lambda x: remove_words(x[remove_col], x[words_to_remove_col]), axis=1)

The above is the detailed content of Two great tips to improve the efficiency of your Pandas code. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles