Run Crawler, Run!

Joshua Botha
Oct 13
2 min read

Updated: Oct 14

The reality is that data pipelines often aren’t built with the latest and greatest technologies, and often the brief is to stay within certain limitations, but still to save costs as much as possible.

Let’s set the stage, then: Imagine multiple sources being ingested daily, each ingestion is a full snapshot of the data source at the time of ingestion, and the schema of the data doesn’t change often. Typically, the data will be partitioned by year, month and day, and the data needs to be able to be queried in Athena. But simply writing an S3 object to the correct location isn’t enough. This is because the Glue Catalog metadata needs to be aware of the new data. So what to do?

AWS Glue Crawlers

The simplest solution is to use Crawlers that run daily. The new data partition will be found and added to the metadata of the Glue Catalog table. The dirty secret about Crawlers, though, is this: Each run is billed for a minimum of 10 min worth of DPUs. Do this for a few 100 tables on a daily basis and you’re in for a surprisingly large bill. Luckily, there are alternatives.

Glue Table Writes

One solution is to update the Glue Catalog metadata at the same moment as the write to S3. Luckily, there’s a way to do that from a Glue job. Instead of sinking the data to S3 directly, the sink operation is run against the Glue table instead:

See this AWS Blog documentation for more info.

AWS API

Another alternative is to make use of the AWS API to update the Glue Catalog directly. Using the Python boto3 SDK this can done by using the create_partition function:

The challenge with this approach, however, is hidden by the convenience, which is to say the PartitionInput require far more than just the new S3 location. For example, it also contains metadata about the data schema. This can be solved by using the get_table or get_partitions function to retrieve the existing metadata to populate the needed fields.

Partition Projection

“Sure AWS must cater for this use case?”, I hear you ask. Why yes, yes they do. To some extent. Instead of the partition-metadata mechanism, a Glue Catalog table can be set to use partition projection instead. Athena then ignores the partition metadata and uses the projection table properties to identify the S3 locations it should try and read from. 

By adding the following table properties

projection.enable = true
projection.year.type = enum
projection.year.range = 2023,2024,2025
 storage.location.template = …/year=${year}/…

When Athena runs a query, it will look for data in the specified location for the three selected years. The same can be done then for the month and day partitions. The use of the update_table function would look like this:

Each of these options comes with its own set of pros and cons that would need to be considered carefully. In one use case, we used the API approach by running a Lambda daily that dynamically updates all the table partitions before the day's ETLs kick off. This reduced the enterprise's Glue DPU hours by 94%, in addition to the S3 get-object call costs. An additional benefit was that there is no delay between the write to S3 and the data being query-able on Athena.

Special thanks to Jean-Pierre Pienaar and Paul Zietsman at couldandthings.io for their contributions.