Home  >  Article  >  Backend Development  >  How Can Regular Expressions Improve Pandas Series Substring Filtering Performance?

How Can Regular Expressions Improve Pandas Series Substring Filtering Performance?

DDD
DDDOriginal
2024-11-27 00:14:10626browse

How Can Regular Expressions Improve Pandas Series Substring Filtering Performance?

Improving Performance for Multiple Substring Filtering in Pandas Series

When attempting to filter rows where a specific string column contains at least one substring from a given list, conventional methods using np.logical_or.reduce() can be inefficient for large datasets. This article explores an alternative approach leveraging regular expressions to enhance performance.

Proposed Solution

Instead of using regex=False in str.contains(), we employ regular expressions after properly escaping the provided substrings using re.escape(). This ensures literal matches rather than regex interpretation. The escaped substrings are then combined into a single pattern using a regex pipe (|).

Masking Process

The masking stage becomes a loop through the series, checking if each string matches the pattern:

df[col].str.contains(pattern, case=False)

Performance Comparison

Using a sample dataset with 100 substrings of length 5 and 50,000 strings of length 20, the proposed method took approximately 1 second. The original method took around 5 seconds for the same data.

Note

This solution assumes a "worst-case" scenario where there are no substring matches. In cases with matches, performance will be further improved. Moreover, this approach is more efficient than the initial method, reducing the number of checks required per row.

The above is the detailed content of How Can Regular Expressions Improve Pandas Series Substring Filtering Performance?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn