Home  >  Article  >  Backend Development  >  How to Count Unique Values in Groups with Pandas?

How to Count Unique Values in Groups with Pandas?

Patricia Arquette
Patricia ArquetteOriginal
2024-10-18 15:52:03115browse

How to Count Unique Values in Groups with Pandas?

Counting Unique Values in Groups with Pandas

When working with datasets containing multiple variables grouped into categories, it often becomes necessary to determine the number of unique values associated with each group. Pandas, a widely used Python library for data manipulation, offers several methods to count unique values within groups.

One common need is to count the number of unique identifiers within each domain. Given a data frame with columns for ID and domain, we seek to obtain a result that displays the count of unique IDs for each domain.

Specifically, considering the data:

      ID   domain
0    123   vk.com
1    123   vk.com
2    123  twitter.com
3    456   vk.com
4    456  facebook.com
5    456   vk.com
6    456   google.com
7    789  twitter.com
8    789   vk.com

We aim to achieve the following output:

domain  count
vk.com       3
twitter.com   2
facebook.com  1
google.com    1

To achieve this, we can employ the nunique() function within the Pandas groupby operation. By grouping the data frame by the domain column and subsequently applying the nunique() function to the ID column, we obtain the count of unique values for each domain. The resulting data frame will contain the desired result:

df = df.groupby(['domain', 'ID']).nunique()

print(df)

However, in certain scenarios, the data may contain characters such as single quotes within the domain names. To handle such cases, we can utilize the str.strip("'") function to remove the single quotes before grouping and counting. This can be implemented as:

df = df.ID.groupby([df.domain.str.strip("'")]).nunique()

print(df)

Alternatively, we can simplify the code by chaining the str.strip("'") function within the groupby operation:

df.groupby(df.domain.str.strip("'"))['ID'].nunique()

To retain the domain column in the resulting data frame, we can use the agg() function with the as_index=False parameter:

df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})

print(df)

This method will return a data frame with both the domain and count columns, where count represents the number of unique IDs associated with each domain.

The above is the detailed content of How to Count Unique Values in Groups with Pandas?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn