Home >Backend Development >Python Tutorial >How to Count Unique Values in Groups with Pandas?
Counting Unique Values in Groups with Pandas
When working with datasets containing multiple variables grouped into categories, it often becomes necessary to determine the number of unique values associated with each group. Pandas, a widely used Python library for data manipulation, offers several methods to count unique values within groups.
One common need is to count the number of unique identifiers within each domain. Given a data frame with columns for ID and domain, we seek to obtain a result that displays the count of unique IDs for each domain.
Specifically, considering the data:
ID domain 0 123 vk.com 1 123 vk.com 2 123 twitter.com 3 456 vk.com 4 456 facebook.com 5 456 vk.com 6 456 google.com 7 789 twitter.com 8 789 vk.com
We aim to achieve the following output:
domain count vk.com 3 twitter.com 2 facebook.com 1 google.com 1
To achieve this, we can employ the nunique() function within the Pandas groupby operation. By grouping the data frame by the domain column and subsequently applying the nunique() function to the ID column, we obtain the count of unique values for each domain. The resulting data frame will contain the desired result:
df = df.groupby(['domain', 'ID']).nunique() print(df)
However, in certain scenarios, the data may contain characters such as single quotes within the domain names. To handle such cases, we can utilize the str.strip("'") function to remove the single quotes before grouping and counting. This can be implemented as:
df = df.ID.groupby([df.domain.str.strip("'")]).nunique() print(df)
Alternatively, we can simplify the code by chaining the str.strip("'") function within the groupby operation:
df.groupby(df.domain.str.strip("'"))['ID'].nunique()
To retain the domain column in the resulting data frame, we can use the agg() function with the as_index=False parameter:
df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique}) print(df)
This method will return a data frame with both the domain and count columns, where count represents the number of unique IDs associated with each domain.
The above is the detailed content of How to Count Unique Values in Groups with Pandas?. For more information, please follow other related articles on the PHP Chinese website!