Explain Codes LogoExplain Codes Logo

How to get input file name as column in AWS Athena external tables

sql
aws
data-lake
query-optimization
Alex KataevbyAlex Kataev·Aug 18, 2024
TLDR

To extract the file name directly from an AWS Athena query, you can employ the $path pseudo column. Here's how you incorporate it in your query. Be sure to adjust your_table to your actual table name:

SELECT "$path" as source_file, * FROM your_table;

Now, for each row in your query result, you have the S3 file path as a new column.

Dealing with large datasets

For those handling an enormous catalogue of files in S3 (imagine 1000+), Athena can be your saving grace. Here's how you can use $path for efficient file name retrieval across large datasets:

SELECT "$path" as file_source, COUNT(*) as records FROM data_lake GROUP BY "$path";

Get the file path and record count for each file. It's like taking a roll call for your data.

Decoding the file path

Keep in mind, $path gives you an S3 URI to the file, not just the slim and trim file name. This might be more than you bargained for, so you can use substring functions to extract just the file name:

SELECT SUBSTRING("$path", CHAR_LENGTH('$path')-POSITION('/' IN REVERSE('$path')) + 2) as file_name, * FROM your_table;

In the substring function, we are locating the last "/" in the string and extracting the rest. It's like asking someone for their name and they give you their whole life story, but you politely interrupt and ask again: "No, just the name please."

Why choose $path?

Ditch the INPUT__FILE__NAME from the Hive days and embrace the $path in Athena for efficient file location retrieval. It's giving you the same result, only more Athena-friendly.

Mind your performance

Two things to look out for:

  1. $path can slow down your query execution
  2. Athena charges you based on scanned data

Optimize by partitioning your data in S3 if you're querying large datasets. Here's an example:

SELECT "$path", count(*) FROM partitioned_table WHERE year = '2022' AND month = '01' GROUP BY "$path";

Here, we've partitioned by year and month. It's like organizing your fridge - everything is easier to find when sorted.

Keep it clean

Don't let $path create a mess in your results. Use aliases and select only the columns that you need:

SELECT SUBSTRING("$path", CHAR_LENGTH('$path')-POSITION('/' IN REVERSE('$path')) + 2) as file_name, column1, column2 FROM table_name;

It's like Marie Kondo for your query data – spark joy and keep only what you need!