Home  >  Article  >  Backend Development  >  Why is np.vectorize() Faster than df.apply() for Pandas Column Creation?

Why is np.vectorize() Faster than df.apply() for Pandas Column Creation?

Susan Sarandon
Susan SarandonOriginal
2024-10-27 04:34:30841browse

  Why is np.vectorize() Faster than df.apply() for Pandas Column Creation?

Performance Comparison of Pandas apply vs np.vectorize

It has been observed that np.vectorize() can be significantly faster than df.apply() when creating a new column based on existing columns in a Pandas DataFrame. The observed performance difference stems from the underlying mechanisms employed by these two methods.

df.apply() vs Python-Level Loops

df.apply() essentially creates a Python-level loop that iterates over each row of the DataFrame. As demonstrated in the provided benchmarks, Python-level loops such as list comprehensions and map are all relatively slow compared to true vectorised calculations.

np.vectorize() vs df.apply()

np.vectorize() converts a user-defined function into a universal function (ufunc). Ufuncs are highly optimised and can perform element-wise operations on NumPy arrays, leveraging C-based code and optimised algorithms. This is in contrast to df.apply(), which operates on Pandas Series objects and incurs additional overhead.

True Vectorisation: Optimal Performance

For truly efficient column creation, vectorised calculations within NumPy are highly recommended. Operations like numpy.where and direct element-wise division with df["A"] / df["B"] are extremely fast and avoid the overheads associated with loops.

Numba Optimisation

For even greater efficiency, it is possible to further optimise loops using Numba, a compiler that translates Python functions into optimised C code. Numba can reduce execution time to microseconds, significantly outperforming both df.apply() and np.vectorize().

Conclusion

While np.vectorize() may offer some improvement over df.apply(), it is not a true substitute for vectorised calculations in NumPy. To achieve maximum performance, utilise Numba optimisation or direct vectorised operations within NumPy for the creation of new columns in Pandas DataFrames.

The above is the detailed content of Why is np.vectorize() Faster than df.apply() for Pandas Column Creation?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn