How to get input file name as column in AWS Athena external tables
To extract the file name directly from an AWS Athena query, you can employ the $path
pseudo column. Here's how you incorporate it in your query. Be sure to adjust your_table
to your actual table name:
Now, for each row in your query result, you have the S3 file path as a new column.
Dealing with large datasets
For those handling an enormous catalogue of files in S3 (imagine 1000+), Athena can be your saving grace. Here's how you can use $path
for efficient file name retrieval across large datasets:
Get the file path and record count for each file. It's like taking a roll call for your data.
Decoding the file path
Keep in mind, $path
gives you an S3 URI to the file, not just the slim and trim file name. This might be more than you bargained for, so you can use substring functions to extract just the file name:
In the substring function, we are locating the last "/" in the string and extracting the rest. It's like asking someone for their name and they give you their whole life story, but you politely interrupt and ask again: "No, just the name please."
Why choose $path
?
Ditch the INPUT__FILE__NAME
from the Hive days and embrace the $path
in Athena for efficient file location retrieval. It's giving you the same result, only more Athena-friendly.
Mind your performance
Two things to look out for:
$path
can slow down your query execution- Athena charges you based on scanned data
Optimize by partitioning your data in S3 if you're querying large datasets. Here's an example:
Here, we've partitioned by year and month. It's like organizing your fridge - everything is easier to find when sorted.
Keep it clean
Don't let $path
create a mess in your results. Use aliases and select only the columns that you need:
It's like Marie Kondo for your query data – spark joy and keep only what you need!
Was this article helpful?