Querying Spark SQL DataFrame with complex types

python

dataframe

higher-order-functions

data-structures

byNikita Barsukov·Dec 22, 2024

Spark DataFrames contain complex types, nested or otherwise. Use explode() function along with dot notation to access elements:

from pyspark.sql.functions import explode

# Ha! Putting off that workout payoff
flattened_df = df.select(explode("arrayCol").alias("elements"))

Reach into nested structures directly using dot syntax:

# It's like an adventure into the Inception movie!
df.select("structCol.field").show()

Access and manipulate complex types in Spark DataFrame effortlessly with these techniques.

Bring out big guns: Higher Order Functions

Working with arrays or maps? Fret not. Spark provides higher order functions like transform, filter, and aggregate. These come as handy tools to drill into the data:

from pyspark.sql.functions import expr

# TRANSFORM! Autobots, roll out!
df.withColumn("transformedArray", expr("transform(arrayCol, x -> x + 1)"))

# Swipe left, swipe right; It's like Tinder for arrays!
df.withColumn("filteredArray", expr("filter(arrayCol, x -> x > 10)"))

# It's not hoarding if it's data!
df.withColumn("aggregatedValue", expr("aggregate(arrayCol, 0, (acc, x) -> acc + x)"))

These are high precision tools for element-level operations within array fields.

Handling JSON columns and map fields

Behold, Spark SQL provides functions like get_json_object and from_json,`to efficiently extract and manipulate data embedded in JSON strings and map fields:

from pyspark.sql.functions import get_json_object, from_json

# Unleash the power of JSON!
df.select(get_json_object(df.jsonColumn, '$.key').alias("value"))

# Unveiling the map inside the JSON
df.withColumn("parsedMap", from_json(df.jsonStringColumn, schema))

PLUS, you can select fields from maps with the dot syntax and wildcard (*):

# Who's the map's favourite singer? Alicia Keys!
df.select("mapColumn.key.*")

Converting RDDs with complex types to DataFrames

Have an RDD with complex types? Switch to DataFrames and own your queries like a boss:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

# SQL's got back. With a Struct.
schema = StructType([
    StructField("id", StringType(), True),
    StructField("data", StringType(), True) # complex? Bring it on!
])

# Non-Stick RDD to DataFrame!
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(rdd, schema)

df.createOrReplaceTempView("my_temp_view") # Name it anything, just don't call it late for dinner.
spark.sql("SELECT * FROM my_temp_view").show()

Taming the nested fields beast with case classes

Have nested fields lurking in your DataFrame? Create a case class representing the Data Structure. Next, convert an RDD to a DataFrame using toDF:

case class MyStructure(nestedField: AnotherStructure)
val myRDD = sc.parallelize(Seq(MyStructure(...)))

val myDataFrame = myRDD.toDF() // Can't touch this!