Explain Codes LogoExplain Codes Logo

Pandas 'count(distinct)' equivalent

python
dataframe
groupby
aggregation
Anton ShumikhinbyAnton Shumikhin·Mar 2, 2025
TLDR

Similar to the COUNT(DISTINCT) function in SQL, .groupby() and .nunique() in Pandas return distinct counts.

unique_counts = df.groupby('key')['value'].nunique()

This line generates a Series that counts unique values per 'key' from the 'value' column.

For a DataFrame output with the flexibility for added aggregation functions, use .agg():

unique_counts_df = df.groupby('key').agg({'value': 'nunique'}) # DataFrame? more like DataFriend!

Count it different ways: Groupby and Aggregation

Using 'nunique' to count uniqueness

Might want to perform multiple distinct count operations on several columns? The agg() method is your friend:

# The more, the merrier! result = df.groupby('YEARMONTH').agg({ 'CLIENTCODE': 'nunique', # unique client codes 'TRANSACTIONTYPE': 'nunique', # unique transaction types 'TOTALSALES': 'sum' # sums the total sales })

Preserving your index

groupby() turns your keys into indexes in the new DataFrame:

result.reset_index(inplace=True) # "I want my index back, index back, index back..."

Resetting the index retains your DataFrame's structure and usability for future operations.

Going flat after aggregation

After performing aggregation with multi-level indexing, it can be helpful to flatten your DataFrame:

# Killin' it "flat" out! result.columns = ['_'.join(col).strip() for col in result.columns.values]

This results in more user-friendly column naming for complex operations.

Deep dive: Pandas transform method

Retaining DataFrame shape with 'transform'

Need the original DataFrame's shape and a new column with the count of unique values? transform has got your back:

# Transformer: DataFrames in disguise! df['unique_count'] = df.groupby('key')['value'].transform('nunique')

Using transform applies the distinct count within each group, resulting in a Series that matches the DataFrame's shape.

Aggregating with transform

Pandas transform lets you utilize various aggregate functions:

# Max sells! Oops, I meant 'sales'! df['max_sales'] = df.groupby('region')['sales'].transform('max')

This creates the maximum sales value for each region, repeated across the corresponding rows.

Alternatives and special cases

Outside of groupby: Counting unique values

For performing distinct count operations across an entire column:

# 'CLIENTCODE' goes solo! unique_client_count = df['CLIENTCODE'].nunique()

This returns the total count of unique clients in the dataset.

More than just a pretty face: 'value_counts()'

For quick checks, value_counts() returns a frequency of unique values:

# Counting on you, 'CLIENTCODE'! frequencies = df['CLIENTCODE'].value_counts()

Note, value_counts() counts frequency rather than categories.

Handling complexity: Beyond the basics

When SQL is not available

If an actual database is inaccessible for deploying SQL queries, Pandas provides a rich set of capabilities for complex operations. Emulating a COUNT(DISTINCT) operation is just a taste of the powerful tools Pandas offer.

Dealing with large datasets

With large datasets, performance can become an issue with distinct counts, consider to chunk your data, or do an initial .drop_duplicates() before counting.

Null and missing values

Pandas counts null values as non-unique, in line with SQL's COUNT(DISTINCT). To count nulls as unique, replace nulls with a unique identifier.