Home  >  Article  >  Technology peripherals  >  Two great tips to improve the efficiency of your Pandas code

Two great tips to improve the efficiency of your Pandas code

王林
王林forward
2024-01-18 20:12:051029browse

If you have ever used Pandas with tabular data, you may be familiar with the process of importing the data, cleaning and transforming it, and then using it as input to the model. However, when you need to scale and put your code into production, your Pandas pipeline will most likely start to crash and run slowly. In this article, I will share 2 tips to help you speed up Pandas code execution, improve data processing efficiency, and avoid common pitfalls.

Two great tips to improve the efficiency of your Pandas code

Tip 1: Vectorization operation

In Pandas, vectorization operation is an efficient tool that can process the entire data in a more concise way columns of boxes without looping row by row.

How does it work?

Broadcasting is a key element of vectorization operations, which allows you to intuitively manipulate objects with different shapes.

eg1: Array a with 3 elements is multiplied by scalar b, resulting in an array with the same shape as Source.

Two great tips to improve the efficiency of your Pandas code

eg2: When performing addition operation, add array a with shape (4,1) and array b with shape (3,), and the result will be An array of shape (4,3).

Two great tips to improve the efficiency of your Pandas code

There have been many articles discussing this, especially in deep learning where large-scale matrix multiplications are common. This article will discuss two brief examples.

First, let's say you want to count the number of times a given integer appears in a column. Here are 2 possible methods.

"""计算DataFrame X 中 "column_1" 列中等于目标值 target 的元素个数。参数:X: DataFrame,包含要计算的列 "column_1"。target: int,目标值。返回值:int,等于目标值 target 的元素个数。"""# 使用循环计数def count_loop(X, target: int) -> int:return sum(x == target for x in X["column_1"])# 使用矢量化操作计数def count_vectorized(X, target: int) -> int:return (X["column_1"] == target).sum()

Now suppose you have a DataFrame with a date column and want to offset it by a given number of days. The calculation using vectorized operations is as follows:

def offset_loop(X, days: int) -> pd.DataFrame:d = pd.Timedelta(days=days)X["column_const"] = [x + d for x in X["column_10"]]return Xdef offset_vectorized(X, days: int) -> pd.DataFrame:X["column_const"] = X["column_10"] + pd.Timedelta(days=days)return X

Tip 2: Iteration

「for loop」

The first and most intuitive way to iterate is to use the Python for loop.

def loop(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:res = []i_remove_col = df.columns.get_loc(remove_col)i_words_to_remove_col = df.columns.get_loc(words_to_remove_col)for i_row in range(df.shape[0]):res.append(remove_words(df.iat[i_row, i_remove_col], df.iat[i_row, i_words_to_remove_col]))return result

「apply」

def apply(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return df.apply(func=lambda x: remove_words(x[remove_col], x[words_to_remove_col]), axis=1).tolist()

On each iteration of df.apply, the provided callable function obtains a Series whose index is df.columns and whose values ​​are rows. This means pandas has to generate the sequence in every loop, which is expensive. To reduce costs, it's best to call apply on the subset of df that you know you will use, like this:

def apply_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return df[[remove_col, words_to_remove_col]].apply(func=lambda x: remove_words(x[remove_col], x[words_to_remove_col]), axis=1)

「List combination itertuples」

Iteration using itertuples combined with lists will definitely work better. itertuples generates (named) tuples with row data.

def itertuples_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return [remove_words(x[0], x[1])for x in df[[remove_col, words_to_remove_col]].itertuples(index=False, name=None)]

「List combination zip」

zip accepts an iterable object and generates a tuple, where the i-th tuple contains all i-th elements of the given iterable object in order.

def zip_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return [remove_words(x, y) for x, y in zip(df[remove_col], df[words_to_remove_col])]

「List combination to_dict」

def to_dict_only_used_columns(df: pd.DataFrame) -> list[str]:return [remove_words(row[remove_col], row[words_to_remove_col])for row in df[[remove_col, words_to_remove_col]].to_dict(orient="records")]

「Caching」

In addition to the iteration techniques we discussed, two other methods can help improve the performance of the code: caching and Parallelization. Caching is particularly useful if you call a pandas function multiple times with the same parameters. For example, if remove_words is applied to a dataset with many duplicate values, you can use functools.lru_cache to store the results of the function and avoid recalculating them each time. To use lru_cache, simply add the @lru_cache decorator to the declaration of remove_words and then apply the function to your dataset using your preferred iteration method. This can significantly improve the speed and efficiency of your code. Take the following code as an example:

@lru_cachedef remove_words(...):... # Same implementation as beforedef zip_only_used_cols_cached(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return [remove_words(x, y) for x, y in zip(df[remove_col], df[words_to_remove_col])]

Adding this decorator generates a function that "remembers" the output of previously encountered input, eliminating the need to run all the code again.

「Parallelization」

The final trump card is to use pandarallel to parallelize our function calls across multiple independent df blocks. The tool is easy to use: you just import and initialize it, then change all .applys to .parallel_applys.

from pandarallel import pandarallelpandarallel.initialize(nb_workers=min(os.cpu_count(), 12))def parapply_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:return df[[remove_col, words_to_remove_col]].parallel_apply(lambda x: remove_words(x[remove_col], x[words_to_remove_col]), axis=1)

The above is the detailed content of Two great tips to improve the efficiency of your Pandas code. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete