Explain Codes LogoExplain Codes Logo

Counting unique values in a column in pandas dataframe like in Qlik?

python
dataframe
pandas
data-analysis
Anton ShumikhinbyAnton Shumikhin·Oct 21, 2024
TLDR

Accurately quantify distinct elements in a Pandas dataframe column using nunique() for the distinct count or value_counts() for the frequency distribution:

import pandas as pd # Simulating your real-world data df = pd.DataFrame({'col': ['X', 'Y', 'X', 'Z', 'Y', 'X']}) # Getting our hands dirty to count distinct elements distinct_count = df['col'].nunique() # Rolling up our sleeves to get frequency of each unique value frequency = df['col'].value_counts() print(distinct_count) # You'll see: 3 print(frequency)

Output:

X    3
Y    2
Z    1
Name: col, dtype: int64

Tailored techniques

To master the art of data analysis, you'll need to be dexterous with queries used to calculate different counts. Let's delve into a couple of cutting-edge techniques:

No nulls, no problem!

On occasions, you might want to exclude null values when counting. That's when count() knocks on the door:

# Null values? Not on my watch! non_null_count = df['col'].count()

Embracing nulls

But hey, sometimes nulls can't be discarded and should be included in the counting spree. Use size as your helmet in such cases:

# All aboard, including nulls! total_count = df['col'].size

Conditions, conditions, conditions

Boolean indexing is your stealth weapon to land conditional counts:

# Counting X's but excluding imposters (nulls) x_count = df[df['col'] == 'X']['col'].count()

A toolbox, not a tool

The agg() function is like a potluck dinner- every function can join the feast:

# One tool, multiple uses: story of `agg()` count_summary = df.agg({'col': ['count', 'nunique', 'size']})

Filter 'n' Count

Provide a filtration mask to your data with query() before counting to extract the most valuable insights:

# Counting after filtering using `query()`: Because clean data is happy data filtered_count = df.query('col == "X"')['col'].count()

Counting marathon across the dataframe

To run a counting marathon across multiple columns, flex the nunique() muscle without attributing any specific column:

# Count of unique values across the dataframe: because variety is the spice of data! all_unique_counts = df.nunique()

Count within groupby()

When clubbing similar data points with groupby(), use count() to account the data falling under each umbrella:

# Counting group-wise: segregate, count, repeat! grouped_counts = df.groupby('col')['value'].count()

Remember the golden rule: Always eyeball your dataframe for duplicate entries with df.drop_duplicates(), else your counting game might slip off!

Counting... in style!

Pandas reserves quite a few arrows in its quiver to tackle complex data scenarios. Let's learn to shoot 'em straight!

Stacking for crisp viewing

groupby() can be paired with result stacking to polish the appeal of presentability:

# Paint a better canvas with stacked counts stacked_counts = df.groupby('col').size().unstack()

Count uniques via pivot_table

If summarizing data and counting unique values are your goals, then pivot_table is the basket to keep all your eggs in:

# Create a pivot and count on the fly! pivot_count = df.pivot_table(index='col1', columns='col2', aggfunc='nunique')

Efficient handling of large data

When dealing with large data, atlases, use `chunksize' to navigate without outspending on memory:

# Handle large data with `chunksize`. Because size matters! chunk_counts = pd.read_csv('large_dataset.csv', chunksize=10000)