Explain Codes LogoExplain Codes Logo

Show distinct column values in pyspark dataframe

python
pandas
dataframe
distinct-values
Alex KataevbyAlex Kataev·Mar 6, 2025
TLDR
Select distinct values from a PySpark DataFrame column like so: df.select("column").distinct().show() This line gets unique values in the "column" and shows them.

Efficient usage: Prevent Titanic-like disaster

When dealing with large datasets, being cautious about resource usage is crucial. You want to avoid filling your memory and sinking your program (like the Titanic).

# Preview before you leap (like checking your parachute before jumping) df.select("column").distinct().show() # Instead of collect()'ing all the data, limit() your greed - PySpark isn't unlimited breadsticks at Olive Garden df.select("column").distinct().limit(20).collect() # When you need to use the distinct list often, assign it to a variable distinct_df = df.select("column").distinct()

Distinct value hacks for large datasets

Handling large datasets requires special tactics. Watch out for those sneaky out of memory errors!

# Don't go about smacking duplicates without discrimination. Only do it for the columns you need. df.dropDuplicates(['col']) # If you need to RDD-ify your data for further operations after getting unique values df.select('column').distinct().rdd.map(lambda r: r[0]).collect() # List comprehension sounds fancy, but it's just another way to write loops. Here's how you get a list of distinct values unique_values_list = [i[0] for i in df.select('column').distinct().collect()]

Boot the duplicates without the aggregation headache

Simplify your code by using .dropDuplicates() and drop the groupby or countDistinct for more readable code.

# Are you just finding distinct values? Forgo the pains of groupby() or countDistinct() df.dropDuplicates(['col']) # Unsure if you're in Inception? To get unique combinations over several columns df.dropDuplicates(['col1', 'col2'])

Jedi strategies for distinct values

Various methods suitable for your distinct needs - pick the one that fits your style:

# Are you a set lover? Here's how to collect values as a set, but remember, sets don't like duplicates df.select("column").agg(F.collect_set("column")) # Raise a Pandas in PySpark. Convert your DataFrame to Pandas and apply unique(). Just don't do kung fu with it df.toPandas().column.unique() # Preview before you commit to a collect(). Remember, commitment is serious business df.select("column").distinct().show()

Need more power than .distinct()?

Here are some possible scenarios where .distinct() might not be enough:

# Do you have complex column types like arrays or maps? Consider using explode() together with distinct() df.select('column').distinct() # Null values can be a pain - they sneak in when you least expect. Decide how to handle them df.filter(df.column.isNotNull()).select('column').distinct() # Need custom deduplication rules? Bring out the big guns - use User-Defined Functions (UDFs) from pyspark.sql.functions import udf custom_dedup = udf(lambda x:... # your brilliant logic here)

SQL-inclined strategies

For those who speak the SQL language, PySpark allows SQL expressions to be used:

df.createOrReplaceTempView("flowers") unique_flowers_df = spark.sql("SELECT DISTINCT `Flower species` FROM flowers") # Justice has been served—if you find SQL charges you less