Explain Codes LogoExplain Codes Logo

Concatenate columns in Apache Spark DataFrame

python
dataframe
pandas
concatenate
Alex KataevbyAlex Kataev·Feb 2, 2025
TLDR

Quickly concatenate columns in a Spark DataFrame using the concat function from pyspark.sql.functions. Here's a basic setup to get started:

from pyspark.sql.functions import concat, col # Assuming a DataFrame 'df' with columns 'FirstName' and 'LastName' df.select(concat(col("FirstName"), col("LastName")).alias("FullName")).show()

This creates a "FullName" column by joining "FirstName" and "LastName". Think of it as the power-couple of our DataFrame!

Handling separators

When the columns need a little space, use concat_ws function to include a separator. It stands for "concatenate with separator":

from pyspark.sql.functions import concat_ws, col # Concatenating with a space as separator df.select(concat_ws(" ", col("FirstName"), col("LastName")).alias("FullName")).show()

Spaces, bringing columns closer since ASCII Character 32!

Tackling NULLS: Don't let missing values ruin your day

Data is imperfect and sometimes contains null values. You can't just ignore your problems, right? To ensure smooth concatenation, use coalesce or when and otherwise:

from pyspark.sql.functions import coalesce, lit df.withColumn("FullName", concat(coalesce(col("FirstName"), lit("")), coalesce(col("LastName"), lit(""))))

Null values getting an identity crisis resolved with empty strings!

User Defined Function: For when built-in isn't enough

Are the standard functions not cutting it? Time for a User Defined Function (UDF):

from pyspark.sql.functions import udf from pyspark.sql.types import StringType def custom_concat(a, b): # Define your custom logic here return a.strip() + " " + b.strip() concat_udf = udf(custom_concat, StringType()) df.withColumn("FullName", concat_udf(col("FirstName"), col("LastName")))

Congratulations! You've taken destiny into your own hands (functions)!

Deep Dive into Spark Concatenation

Leveraging selectExpr for concise code

This compact solution allows SQL expressions to soften up our syntax muscle:

df.selectExpr("concat(FirstName, LastName) as FullName").show()

Less code, shorter Stack Overflow questions!

Direct SQL Queries: The Sequel to SQL

Who says you can't teach an old SQL new tricks?

df.createOrReplaceTempView("people") spark.sql("SELECT CONCAT(FirstName, LastName) AS FullName FROM people").show()

You can now write SQL in the middle of your Python code, perhaps we've invented PySQL.

String Interpolation with ||

A goto shortcut for concatenating string columns:

df.selectExpr("FirstName || LastName as FullName").show()

Who knew pipes (||) were the duct tape of SQL!