Pandas 'count(distinct)' equivalent
Similar to the COUNT(DISTINCT)
function in SQL, .groupby()
and .nunique()
in Pandas return distinct counts.
This line generates a Series that counts unique values per 'key' from the 'value' column.
For a DataFrame output with the flexibility for added aggregation functions, use .agg()
:
Count it different ways: Groupby and Aggregation
Using 'nunique' to count uniqueness
Might want to perform multiple distinct count operations on several columns? The agg() method is your friend:
Preserving your index
groupby()
turns your keys into indexes in the new DataFrame:
Resetting the index retains your DataFrame's structure and usability for future operations.
Going flat after aggregation
After performing aggregation with multi-level indexing, it can be helpful to flatten your DataFrame:
This results in more user-friendly column naming for complex operations.
Deep dive: Pandas transform method
Retaining DataFrame shape with 'transform'
Need the original DataFrame's shape and a new column with the count of unique values? transform
has got your back:
Using transform applies the distinct count within each group, resulting in a Series that matches the DataFrame's shape.
Aggregating with transform
Pandas transform
lets you utilize various aggregate functions:
This creates the maximum sales value for each region, repeated across the corresponding rows.
Alternatives and special cases
Outside of groupby: Counting unique values
For performing distinct count operations across an entire column:
This returns the total count of unique clients in the dataset.
More than just a pretty face: 'value_counts()'
For quick checks, value_counts()
returns a frequency of unique values:
Note, value_counts()
counts frequency rather than categories.
Handling complexity: Beyond the basics
When SQL is not available
If an actual database is inaccessible for deploying SQL queries, Pandas provides a rich set of capabilities for complex operations. Emulating a COUNT(DISTINCT)
operation is just a taste of the powerful tools Pandas offer.
Dealing with large datasets
With large datasets, performance can become an issue with distinct counts, consider to chunk your data, or do an initial .drop_duplicates()
before counting.
Null and missing values
Pandas counts null values as non-unique, in line with SQL's COUNT(DISTINCT)
. To count nulls as unique, replace nulls with a unique identifier.
Was this article helpful?