Select distinct values from a PySpark DataFrame column like so:
df.select("column").distinct().show()
This line gets unique values in the "column"and shows them.
Efficient usage: Prevent Titanic-like disaster
When dealing with large datasets, being cautious about resource usage is crucial. You want to avoid filling your memory and sinking your program (like the Titanic).
# Preview before you leap (like checking your parachute before jumping)df.select("column").distinct().show()
# Instead of collect()'ing all the data, limit() your greed - PySpark isn't unlimited breadsticks at Olive Gardendf.select("column").distinct().limit(20).collect()
# When you need to use the distinct list often, assign it to a variabledistinct_df = df.select("column").distinct()
Distinct value hacks for large datasets
Handling large datasets requires special tactics. Watch out for those sneaky out of memory errors!
# Don't go about smacking duplicates without discrimination. Only do it for the columns you need.df.dropDuplicates(['col'])
# If you need to RDD-ify your data for further operations after getting unique valuesdf.select('column').distinct().rdd.map(lambda r: r[0]).collect()
# List comprehension sounds fancy, but it's just another way to write loops. Here's how you get a list of distinct valuesunique_values_list = [i[0] for i in df.select('column').distinct().collect()]
Boot the duplicates without the aggregation headache
Simplify your code by using .dropDuplicates() and drop the groupby or countDistinct for more readable code.
# Are you just finding distinct values? Forgo the pains of groupby() or countDistinct() df.dropDuplicates(['col'])
# Unsure if you're in Inception? To get unique combinations over several columnsdf.dropDuplicates(['col1', 'col2'])
Jedi strategies for distinct values
Various methods suitable for your distinct needs - pick the one that fits your style:
# Are you a set lover? Here's how to collect values as a set, but remember, sets don't like duplicatesdf.select("column").agg(F.collect_set("column"))
# Raise a Pandas in PySpark. Convert your DataFrame to Pandas and apply unique(). Just don't do kung fu with itdf.toPandas().column.unique()
# Preview before you commit to a collect(). Remember, commitment is serious businessdf.select("column").distinct().show()
Need more power than .distinct()?
Here are some possible scenarios where .distinct() might not be enough:
# Do you have complex column types like arrays or maps? Consider using explode() together with distinct()df.select('column').distinct()
# Null values can be a pain - they sneak in when you least expect. Decide how to handle themdf.filter(df.column.isNotNull()).select('column').distinct()
# Need custom deduplication rules? Bring out the big guns - use User-Defined Functions (UDFs)from pyspark.sql.functions import udf
custom_dedup = udf(lambda x:... # your brilliant logic here)
SQL-inclined strategies
For those who speak the SQL language, PySpark allows SQL expressions to be used:
df.createOrReplaceTempView("flowers")
unique_flowers_df = spark.sql("SELECT DISTINCT `Flower species` FROM flowers")
# Justice has been served—if you find SQL charges you less