Building a serverless FinOps platform at cloudandthings.io that processes thousands of events nightly

Paul Zietsman
Jul 15
6 min read

Updated: Jul 21

If you've been managing AWS environments long enough, you'll know that achieving 95% cost optimization has become almost impossible for most organizations. Between Reserved Instances, Savings Plans, right-sizing recommendations, and unused resources, the complexity of cloud cost management has grown exponentially. That's exactly the challenge Foundation Zero set out to solve when they approached us to build their Cloud Financial Operations platform.

The requirement was clear: create a scalable, serverless architecture capable of processing thousands of cost optimization events every night while providing real-time insights and actionable recommendations. The platform needed to handle multi-account AWS environments, generate optimization opportunities, and enable team-based accountability, all while maintaining the performance and reliability expected from a production FinOps solution.

The architectural challenge

Foundation Zero's vision required processing vast amounts of AWS billing data, Cost and Usage Reports (CUR), and resource metadata across multiple customer accounts. The platform needed to:

Process billing data at scale: Analyze thousands of cost optimization opportunities across multiple AWS accounts nightly
Real-time insights: Generate and surface cost optimization recommendations within seconds of data changes
Multi-tenant isolation: Ensure complete data separation between different customer organizations
Event-driven processing: React to changes in customer environments and billing data automatically
Scalable assignment system: Support unlimited recommendations assigned to team members with full audit trails

Traditional architectures would require significant infrastructure management overhead and wouldn't scale cost-effectively with variable workloads. The spiky nature of nightly processing jobs and real-time query demands called for a serverless approach.

Serverless architecture: The foundation of Foundation Zero

We designed Foundation Zero's architecture around three core serverless components that work together to deliver the platform's capabilities:

Scheduling Service

The Scheduling Service orchestrates the nightly processing of cost optimization events. Built entirely on AWS EventBridge and Lambda, this service manages complex workflows without requiring dedicated compute resources. EventBridge rules trigger Lambda functions at predetermined intervals, initiating the cost analysis pipeline for each customer account.

The service intelligently manages processing windows to avoid overwhelming downstream services. By leveraging EventBridge's native scheduling capabilities, we eliminated the need for traditional job schedulers while gaining built-in retry logic and error handling. Each scheduled event triggers a Lambda function that validates customer configurations, checks processing prerequisites, and initiates the crawler service for that specific account.

Crawler Service

The heart of Foundation Zero's data processing capability lies in the Crawler Service. This component processes thousands of cost optimization opportunities by analyzing AWS Cost and Usage Reports, resource configurations, and usage patterns. The service architecture combines several AWS services:

Lambda functions form the processing backbone, with each function optimized for specific analysis tasks—Reserved Instance recommendations, idle resource detection, or right-sizing calculations. These functions scale automatically based on the volume of data being processed, from a single account to hundreds simultaneously.

SQS queues manage the flow of processing tasks, ensuring reliable message delivery and providing natural backpressure when processing large datasets. Dead letter queues capture any failed processing attempts for investigation without blocking the main pipeline.

S3 buckets store intermediate processing results and serve as the data lake for historical analysis. By leveraging S3's lifecycle policies, we automatically archive old data to reduce storage costs while maintaining compliance requirements.

The crawler implements intelligent batching strategies to optimize Lambda execution time and reduce cold starts. Each function processes batches of resources or cost records, writing results to RDS for immediate querying while also storing detailed analytics data in S3 for long-term analysis.

Aggregator Service

The Aggregator Service transforms raw cost data into actionable insights and recommendations. This service combines real-time and batch processing patterns to deliver immediate visibility while handling complex calculations.

RDS PostgreSQL provides the relational database foundation, engineered to handle the massive influx of events from the crawler fleet. We optimized PostgreSQL's configuration for high-throughput write operations, implementing connection pooling and batch insert strategies that allow the database to ingest thousands of cost optimization events per minute during peak processing windows. The database stores aggregated metrics, user assignments, and team hierarchies, enabling complex queries for the account catalog feature.

Lambda functions connected to API Gateway serve real-time queries, analyzing cost optimization opportunities by combining multiple efficiency factors. These calculations consider Reserved Instance coverage, Savings Plan utilization, resource optimization opportunities, and spending trends to generate actionable recommendations.

The aggregation pipeline processes results from the Crawler Service, updating team and application-level metrics while maintaining historical trends. This enables the platform's filtering capabilities, allowing users to analyze cost optimization opportunities across different organizational dimensions.

Security and multi-tenancy through AWS Cognito

Security and tenant isolation were paramount given the sensitive nature of cloud billing data. AWS Cognito provides the authentication layer, managing user identities and enforcing access controls. Each customer organization operates within isolated data boundaries, with Cognito user pools ensuring complete separation.

Lambda functions validate JWT tokens from Cognito on every API request, extracting tenant context to apply appropriate data filters. This pattern ensures that even if multiple organizations' data is processed simultaneously, there's no possibility of data leakage between tenants.

IAM roles and policies further enforce isolation at the infrastructure level. Each Lambda function assumes minimal required permissions, following the principle of least privilege. Cross-account access for reading customer billing data uses temporary credentials from AWS STS, ensuring Foundation Zero never stores long-term customer credentials.

Processing at scale: Thousands of events nightly

The platform's ability to process thousands of cost optimization events nightly showcases the power of serverless architectures. During peak processing windows, the system can spawn hundreds of concurrent Lambda executions, each analyzing different aspects of cloud spend.

EventBridge orchestrates this complex dance of services, triggering processing workflows based on data availability and system capacity. The event-driven architecture means new cost data immediately triggers analysis, ensuring optimization recommendations reflect the latest opportunities.

SQS queues provide natural buffering between services, allowing the Crawler Service to process data as fast as possible while preventing overwhelming downstream aggregation. This design enables predictable performance regardless of whether the system is processing ten accounts or ten thousand.

The combination of Lambda's automatic scaling and RDS PostgreSQL's optimized configuration means the platform handles variable loads efficiently. During quiet periods, costs remain low, while peak processing times see automatic scale-out to meet demand.

Engineering RDS PostgreSQL and API Gateway for high-throughput event processing

The real engineering challenge came in ensuring RDS PostgreSQL could handle the nightly tsunami of events from hundreds of concurrent crawler Lambda functions. We implemented several optimization strategies to achieve the required throughput:

Connection pooling with RDS Proxy: Rather than having each Lambda function establish direct database connections, we implemented RDS Proxy to manage connection pooling. This reduced connection overhead and allowed PostgreSQL to handle thousands of Lambda invocations without exhausting connection limits. The proxy maintains a warm pool of connections, dramatically reducing the latency of database operations.

Partitioned tables for time-series data: We implemented table partitioning by date for event storage, allowing PostgreSQL to efficiently manage the millions of events accumulated over time. This strategy ensures query performance remains consistent even as data volume grows, while enabling efficient data archival.

API Gateway caching strategies: To handle the read load from dashboards and real-time cost analytics, we implemented intelligent caching at the API Gateway level. Frequently accessed metrics and aggregations are cached with short TTLs, reducing database load while ensuring users see near real-time data. This caching layer absorbs thousands of API requests per minute without impacting the database's ability to process incoming crawler events.

Best practices and lessons learned

Building Foundation Zero reinforced several serverless best practices:

Function composition over monoliths: Instead of large, complex Lambda functions, we created focused functions that do one thing well. This improved debugging, reduced cold start times, and enabled fine-grained scaling.

Asynchronous patterns everywhere: By embracing asynchronous processing through SQS and EventBridge, we created a resilient system that handles failures gracefully and scales naturally.

Data partitioning strategies: Organizing S3 data by customer, date, and analysis type enabled efficient querying while supporting data lifecycle management.

Observability from day one: Comprehensive CloudWatch metrics and X-Ray tracing provided visibility into the distributed system's behavior, crucial for optimizing performance and debugging issues.

Delivering true business value

The serverless architecture delivered exactly what Foundation Zero needed: a platform that scales with their customer base while maintaining predictable costs. The pay-per-use model of Lambda and optimized RDS PostgreSQL configuration means their infrastructure costs align with platform usage while handling massive event throughput.

Processing thousands of cost optimization events nightly without managing servers allows Foundation Zero's team to focus on improving their Cloud Score algorithms and customer experience rather than infrastructure maintenance. The automatic scaling capabilities ensure consistent performance whether analyzing a startup's single account or an enterprise's hundreds of accounts.

Most importantly, the architecture enables Foundation Zero to practice what they preach, optimizing cloud costs. By leveraging serverless services, they've built a FinOps platform that exemplifies cloud cost efficiency while delivering powerful capabilities to their customers.

Conclusion

Building Foundation Zero on AWS serverless services proved that complex, data-intensive platforms don't require traditional server-based architectures. The combination of Lambda, RDS PostgreSQL, EventBridge, and supporting services created a scalable, cost-effective solution that processes thousands of cost optimization events nightly while providing real-time insights to users.

The success of this architecture demonstrates how serverless patterns can handle enterprise-scale workloads while maintaining the agility and cost efficiency that modern platforms demand. For organizations looking to build similar data processing platforms, the Foundation Zero architecture provides a blueprint for leveraging AWS serverless services to deliver true business value.

If you're facing similar challenges with building scalable data processing platforms or want to explore how serverless architectures can transform your applications, we'd love to discuss your project. Contact us at connect@cloudandthings.io to start the conversation.