If you've been managing AWS environments long enough, you'll know that achieving 95% cost optimization has become almost impossible for most organizations. Between Reserved Instances, Savings Plans, right-sizing recommendations, and unused resources, the complexity of cloud cost management has grown exponentially. That's exactly the challenge Foundation Zero set out to solve when they approached us to build their Cloud Financial Operations platform.
The requirement was clear: create a scalable, serverless architecture capable of processing thousands of cost optimization events every night while providing real-time insights and actionable recommendations. The platform needed to handle multi-account AWS environments, generate optimization opportunities, and enable team-based accountability — all while maintaining the performance and reliability expected from a production FinOps solution.
The architectural challenge
Foundation Zero's vision required processing vast amounts of AWS billing data, Cost and Usage Reports (CUR), and resource metadata across multiple customer accounts. The platform needed to:
- Process billing data at scale: analyze thousands of cost optimization opportunities across multiple AWS accounts nightly.
- Real-time insights: generate and surface cost optimization recommendations within seconds of data changes.
- Multi-tenant isolation: ensure complete data separation between different customer organizations.
- Event-driven processing: react to changes in customer environments and billing data automatically.
- Scalable assignment system: support unlimited recommendations assigned to team members with full audit trails.
Traditional architectures would require significant infrastructure management overhead and wouldn't scale cost-effectively with variable workloads. The spiky nature of nightly processing jobs and real-time query demands called for a serverless approach.
Serverless architecture: the foundation
We designed Foundation Zero's architecture around three core serverless components that work together to deliver the platform's capabilities.
Scheduling Service
The Scheduling Service orchestrates the nightly processing of cost optimization events. Built entirely on AWS EventBridge and Lambda, this service manages complex workflows without requiring dedicated compute resources. EventBridge rules trigger Lambda functions at predetermined intervals, initiating the cost analysis pipeline for each customer account.
The service intelligently manages processing windows to avoid overwhelming downstream services. By leveraging EventBridge's native scheduling capabilities, we eliminated the need for traditional job schedulers while gaining built-in retry logic and error handling. Each scheduled event triggers a Lambda function that validates customer configurations, checks processing prerequisites, and initiates the crawler service for that specific account.
Crawler Service
The heart of Foundation Zero's data processing capability lies in the Crawler Service. This component processes thousands of cost optimization opportunities by analyzing AWS Cost and Usage Reports, resource configurations, and usage patterns. Lambda functions form the processing backbone, with each function optimized for specific analysis tasks — Reserved Instance recommendations, idle resource detection, or right-sizing calculations. These functions scale automatically based on the volume of data being processed, from a single account to hundreds simultaneously.
SQS queues manage the flow of processing tasks, ensuring reliable message delivery and providing natural backpressure when processing large datasets. Dead letter queues capture any failed processing attempts for investigation without blocking the main pipeline. S3 buckets store intermediate processing results and serve as the data lake for historical analysis, with lifecycle policies automatically archiving old data to reduce storage costs.
The crawler implements intelligent batching strategies to optimize Lambda execution time and reduce cold starts. Each function processes batches of resources or cost records, writing results to RDS for immediate querying while also storing detailed analytics data in S3 for long-term analysis.
Aggregator Service
The Aggregator Service transforms raw cost data into actionable insights and recommendations, combining real-time and batch processing patterns. RDS PostgreSQL provides the relational database foundation, engineered to handle the massive influx of events from the crawler fleet. We optimized PostgreSQL's configuration for high-throughput write operations, implementing connection pooling and batch insert strategies that allow the database to ingest thousands of cost optimization events per minute during peak processing windows.
Lambda functions connected to API Gateway serve real-time queries, analyzing cost optimization opportunities by combining multiple efficiency factors — Reserved Instance coverage, Savings Plan utilization, resource optimization opportunities, and spending trends — to generate actionable recommendations.
Security and multi-tenancy through AWS Cognito
Security and tenant isolation were paramount given the sensitive nature of cloud billing data. AWS Cognito provides the authentication layer, managing user identities and enforcing access controls. Each customer organization operates within isolated data boundaries, with Cognito user pools ensuring complete separation.
Lambda functions validate JWT tokens from Cognito on every API request, extracting tenant context to apply appropriate data filters. IAM roles and policies further enforce isolation at the infrastructure level, with each Lambda function assuming minimal required permissions. Cross-account access for reading customer billing data uses temporary credentials from AWS STS, ensuring Foundation Zero never stores long-term customer credentials.
Processing at scale: thousands of events nightly
During peak processing windows, the system can spawn hundreds of concurrent Lambda executions, each analyzing different aspects of cloud spend. EventBridge orchestrates this complex dance of services, triggering processing workflows based on data availability and system capacity. SQS queues provide natural buffering between services, allowing the Crawler Service to process data as fast as possible while preventing overwhelming downstream aggregation. This design enables predictable performance regardless of whether the system is processing ten accounts or ten thousand.
Engineering RDS PostgreSQL for high throughput
The real engineering challenge came in ensuring RDS PostgreSQL could handle the nightly tsunami of events from hundreds of concurrent crawler Lambda functions:
- Connection pooling with RDS Proxy: rather than each Lambda establishing direct connections, RDS Proxy manages a warm pool — reducing connection overhead and letting PostgreSQL handle thousands of invocations without exhausting connection limits.
- Partitioned tables for time-series data: partitioning by date lets PostgreSQL efficiently manage millions of accumulated events, keeping query performance consistent as data grows and enabling efficient archival.
- API Gateway caching: frequently accessed metrics and aggregations are cached with short TTLs, absorbing thousands of API requests per minute without impacting the database's ability to process incoming crawler events.
Best practices and lessons learned
- Function composition over monoliths: focused functions that do one thing well improved debugging, reduced cold starts, and enabled fine-grained scaling.
- Asynchronous patterns everywhere: embracing SQS and EventBridge created a resilient system that handles failures gracefully and scales naturally.
- Data partitioning strategies: organizing S3 data by customer, date, and analysis type enabled efficient querying while supporting data lifecycle management.
- Observability from day one: comprehensive CloudWatch metrics and X-Ray tracing provided visibility into the distributed system's behavior.
Conclusion
Building Foundation Zero on AWS serverless services proved that complex, data-intensive platforms don't require traditional server-based architectures. The combination of Lambda, RDS PostgreSQL, EventBridge, and supporting services created a scalable, cost-effective solution that processes thousands of cost optimization events nightly while providing real-time insights — and exemplifies the very cloud cost efficiency the platform exists to deliver.
If you're facing similar challenges with scalable data processing platforms, or want to explore how serverless can transform your applications, we'd love to talk.
Start the conversation →