Show distinct column values in pyspark dataframe

python

pandas

dataframe

distinct-values

byAlex Kataev·Mar 6, 2025

Select distinct values from a PySpark DataFrame column like so:

df.select("column").distinct().show()

This line gets unique values in the "column" and shows them.

Efficient usage: Prevent Titanic-like disaster

When dealing with large datasets, being cautious about resource usage is crucial. You want to avoid filling your memory and sinking your program (like the Titanic).

# Preview before you leap (like checking your parachute before jumping)
df.select("column").distinct().show()

# Instead of collect()'ing all the data, limit() your greed - PySpark isn't unlimited breadsticks at Olive Garden
df.select("column").distinct().limit(20).collect()

# When you need to use the distinct list often, assign it to a variable
distinct_df = df.select("column").distinct()

Distinct value hacks for large datasets

Handling large datasets requires special tactics. Watch out for those sneaky out of memory errors!

# Don't go about smacking duplicates without discrimination. Only do it for the columns you need.
df.dropDuplicates(['col'])

# If you need to RDD-ify your data for further operations after getting unique values
df.select('column').distinct().rdd.map(lambda r: r[0]).collect()

# List comprehension sounds fancy, but it's just another way to write loops. Here's how you get a list of distinct values
unique_values_list = [i[0] for i in df.select('column').distinct().collect()]

Boot the duplicates without the aggregation headache

Simplify your code by using .dropDuplicates() and drop the groupby or countDistinct for more readable code.

# Are you just finding distinct values? Forgo the pains of groupby() or countDistinct() 
df.dropDuplicates(['col'])

# Unsure if you're in Inception? To get unique combinations over several columns
df.dropDuplicates(['col1', 'col2'])

Jedi strategies for distinct values

Various methods suitable for your distinct needs - pick the one that fits your style:

# Are you a set lover? Here's how to collect values as a set, but remember, sets don't like duplicates
df.select("column").agg(F.collect_set("column"))

# Raise a Pandas in PySpark. Convert your DataFrame to Pandas and apply unique(). Just don't do kung fu with it
df.toPandas().column.unique()

# Preview before you commit to a collect(). Remember, commitment is serious business
df.select("column").distinct().show()

Need more power than `.distinct()`?

Here are some possible scenarios where .distinct() might not be enough:

# Do you have complex column types like arrays or maps? Consider using explode() together with distinct()
df.select('column').distinct()

# Null values can be a pain - they sneak in when you least expect. Decide how to handle them
df.filter(df.column.isNotNull()).select('column').distinct()

# Need custom deduplication rules? Bring out the big guns - use User-Defined Functions (UDFs)
from pyspark.sql.functions import udf
custom_dedup = udf(lambda x:... # your brilliant logic here)

SQL-inclined strategies

For those who speak the SQL language, PySpark allows SQL expressions to be used:

df.createOrReplaceTempView("flowers")
unique_flowers_df = spark.sql("SELECT DISTINCT `Flower species` FROM flowers")
# Justice has been served—if you find SQL charges you less