Home >Backend Development >Python Tutorial >How Can I Efficiently Filter a Pandas Series for Multiple Substrings?
Efficient Pandas Filtering for Multiple Substrings in a Series
Determining whether a series contains any of several substrings is a common task in data analysis. While using logical or to combine individual str.contains operations offers a straightforward solution, it can be inefficient for long substrings lists and large dataframes.
To optimize this task, consider adopting a regular expression (regex) approach. By wrapping the substrings in a regex pattern, we can leverage pandas' efficient string matching functions. Specifically, after escaping any special characters in the substrings, we can construct a regex pattern by joining the substrings using the pipe character (|):
import re esc_lst = [re.escape(s) for s in lst] pattern = '|'.join(esc_lst)
With this pattern, we can filter the series using str.contains and case-insensitive matching:
df[col].str.contains(pattern, case=False)
This approach offers improved performance, especially for large dataframes. Consider the following example:
from random import randint, seed seed(321) # 100 substrings of 5 characters lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)] # 50000 strings of 20 characters strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)] col = pd.Series(strings) esc_lst = [re.escape(s) for s in lst] pattern = '|'.join(esc_lst)
Using this optimized approach, the filtering operation takes approximately 1 second for 50,000 rows and 100 substrings, significantly faster than the method described in the original question. The performance difference becomes even more pronounced for larger dataframes and substrings lists.
The above is the detailed content of How Can I Efficiently Filter a Pandas Series for Multiple Substrings?. For more information, please follow other related articles on the PHP Chinese website!