How to get input file name as column in AWS Athena external tables

sql

aws

data-lake

query-optimization

byAlex Kataev·Aug 18, 2024

To extract the file name directly from an AWS Athena query, you can employ the $path pseudo column. Here's how you incorporate it in your query. Be sure to adjust your_table to your actual table name:

SELECT "$path" as source_file, *
FROM your_table;

Now, for each row in your query result, you have the S3 file path as a new column.

Dealing with large datasets

For those handling an enormous catalogue of files in S3 (imagine 1000+), Athena can be your saving grace. Here's how you can use $path for efficient file name retrieval across large datasets:

SELECT "$path" as file_source, COUNT(*) as records
FROM data_lake
GROUP BY "$path";

Get the file path and record count for each file. It's like taking a roll call for your data.

Decoding the file path

Keep in mind, $path gives you an S3 URI to the file, not just the slim and trim file name. This might be more than you bargained for, so you can use substring functions to extract just the file name:

SELECT SUBSTRING("$path", CHAR_LENGTH('$path')-POSITION('/' IN REVERSE('$path')) + 2) as file_name, 
       *
FROM your_table;

In the substring function, we are locating the last "/" in the string and extracting the rest. It's like asking someone for their name and they give you their whole life story, but you politely interrupt and ask again: "No, just the name please."

Why choose `$path`?

Ditch the INPUT__FILE__NAME from the Hive days and embrace the $path in Athena for efficient file location retrieval. It's giving you the same result, only more Athena-friendly.

Mind your performance

Two things to look out for:

$path can slow down your query execution
Athena charges you based on scanned data

Optimize by partitioning your data in S3 if you're querying large datasets. Here's an example:

SELECT "$path", count(*)
FROM partitioned_table
WHERE year = '2022' AND month = '01'
GROUP BY "$path";

Here, we've partitioned by year and month. It's like organizing your fridge - everything is easier to find when sorted.

Keep it clean

Don't let $path create a mess in your results. Use aliases and select only the columns that you need:

SELECT
  SUBSTRING("$path", CHAR_LENGTH('$path')-POSITION('/' IN REVERSE('$path')) + 2) as file_name, 
  column1, 
  column2
FROM table_name;

It's like Marie Kondo for your query data – spark joy and keep only what you need!

explain-codes / Sql / How to get input file name as column in AWS Athena external tables

Linked

Save Dataframe to csv directly to s3 Python



Just get column names from a Hive table



Athena Greater Than Condition in Date Column



Get everything after and before certain character in SQL Server



How to write a file or data to an S3 object using boto3



Get table column names in MySQL?



Check if a key exists in a bucket in S3 using boto3



Dealing with large datasets Decoding the file path Why choose ?Mind your performance Keep it clean