Home/Blog/Data engineering
Data engineering · AWS

Run Crawler, Run!

The reality is that data pipelines often aren't built with the latest and greatest technologies, and often the brief is to stay within certain limitations, but still to save costs as much as possible.

Let's set the stage, then: imagine multiple sources being ingested daily, each ingestion is a full snapshot of the data source at the time of ingestion, and the schema of the data doesn't change often. Typically, the data will be partitioned by year, month and day, and the data needs to be queryable in Athena. But simply writing an S3 object to the correct location isn't enough — the Glue Catalog metadata needs to be aware of the new data. So what to do?

AWS Glue Crawlers

The simplest solution is to use Crawlers that run daily. The new data partition will be found and added to the metadata of the Glue Catalog table. The dirty secret about Crawlers, though, is this: each run is billed for a minimum of 10 minutes worth of DPUs. Do this for a few hundred tables on a daily basis and you're in for a surprisingly large bill. Luckily, there are alternatives.

Glue table writes

One solution is to update the Glue Catalog metadata at the same moment as the write to S3. There's a way to do that from a Glue job: instead of sinking the data to S3 directly, the sink operation is run against the Glue table instead. See this AWS documentation for more info.

AWS API

Another alternative is to use the AWS API to update the Glue Catalog directly. Using the Python boto3 SDK this can be done with the create_partition function. The challenge with this approach, however, is hidden by the convenience: the PartitionInput requires far more than just the new S3 location — it also contains metadata about the data schema. This can be solved by using the get_table or get_partitions function to retrieve the existing metadata to populate the needed fields.

Partition projection

"Surely AWS must cater for this use case?", I hear you ask. Why yes, yes they do — to some extent. Instead of the partition-metadata mechanism, a Glue Catalog table can be set to use partition projection instead. Athena then ignores the partition metadata and uses the projection table properties to identify the S3 locations it should try to read from. By adding the following table properties:

projection.enable             = true
projection.year.type          = enum
projection.year.range         = 2023,2024,2025
storage.location.template     = .../year=${year}/...

When Athena runs a query, it will look for data in the specified location for the three selected years. The same can then be done for the month and day partitions, using the update_table function.

Each of these options comes with its own set of pros and cons that would need to be considered carefully. In one use case, we used the API approach by running a Lambda daily that dynamically updates all the table partitions before the day's ETLs kick off. This reduced the enterprise's Glue DPU hours by 94%, in addition to the S3 get-object call costs. An additional benefit was that there is no delay between the write to S3 and the data being queryable on Athena.

Special thanks to Jean-Pierre Pienaar and Paul Zietsman at cloudandthings.io for their contributions.

// Pipelines paging you?

We build data pipelines that are observable, schema-aware, and cost-aware — and we love a good FinOps win.

Talk to our data team →
Back to all posts

Keep reading

Back to all posts