Serverless with AWS Lambda: Reducing metrics reporting lag from hours to seconds at ReSci

Author: Avi Sanadhya, ReSci Platform Engineering Team

At Retention Science we deliver personalized marketing campaigns powered by machine learning to drive a deeper level of customer engagement. Our AI engine, Cortex, is responsible for billions of predictions daily and hundreds of millions of personalized emails each month. As this number grows, it becomes increasingly important to report the campaign metrics in a fast, efficient, and fault tolerant way. 

In a recent project, we upgraded our existing nightly metrics reporting pipeline to an efficient real-time streaming pipeline. We achieved this using AWS Lambda in conjunction with Amazon Kinesis, Amazon Aurora and Amazon S3.  

legacy-events-pipeline

Legacy Batch Pipeline

We capture different user events (e.g. email opens and clicks) and forward them to an Amazon Kinesis stream. Kinesis workers developed using the Kinesis Client Library pull these actions and populate our persistent data stores. Previously, a nightly ETL process generated reporting metrics for all email campaigns by loading and processing this events data. The service was written in Scala and leveraged HDFS and Apache Spark to crunch data.

While the ETL was decently quick, there were several drawbacks of this approach. Firstly, the lag (up to 24 hours) in reporting due to the nature of batch updates provided a poor experience for our end-users. Secondly, as our client base grew, the ETL costs were increasing significantly. Finally, the overhead of managing and provisioning resources to run the ETLs was becoming a pain point

Magic Wand: Serverless Computing Using AWS Lambda

The shortcomings in our previous approach led us to research and brainstorm new methods and technologies. We ended up designing a fast, efficient, and reliable system powered by AWS Lambda. AWS Lambda is a serverless computing technology developed by Amazon that enables developers to run code without provisioning or managing servers. The cost is calculated based on how often and how long the code runs for. Lambda code can be triggered by events from other AWS services such as S3, DynamoDB, and Kinesis.

Kinesis Lambda Pipeline

Kinesis Lambda Pipeline

There are two main event sources (Kinesis streams) in our realtime pipeline: email deliverability events from our email service and the user events (e.g. email opens and clicks) from our events tracker. Kinesis workers process the user events and forward metric-related events to another kinesis stream. In our pipeline, AWS Lambda polls both the user events and deliverability events Kinesis streams. This lambda is responsible for processing the events and generating aggregated email metrics in real-time. We still run the the batch pipeline to maintain backward compatibility, and to have the flexibility to rollback if anything goes wrong in the real-time pipeline.

Let me run through an example to showcase how fast and efficient AWS Lambda is:

  1. A customer clicks on an email sent by our email service.
  2. The email click is picked up by our event trackers and is forwarded to the Kinesis stream. 
  3. By continuously polling the stream, the Kinesis workers pick up the click event almost immediately, format and augment it with relevant details, and forwards it along to the user events Kinesis stream.
  4. AWS Lambda polls the user events kinesis stream every 200 milliseconds and picks up the formatted message. The lambda aggregates similar events by campaign and stores them in the reporting database. 
  5. Our Analytics dashboard hit a service API to read these numbers from the database tables.
Workflow of lambda updating metrics tables

Workflow of lambda updating metrics tables

Since the Kinesis workers and the Lambda are event-driven, it takes less than a couple seconds for any given event to reach its destination. Due to the continuous scaling nature of Lambda, the processing time remains consistent whether we receive hundreds or millions of events.

Conclusion

AWS Lambda is proving to be an efficient and scalable way to process real-time data. Since its inception in 2014, it has seen rapid adoption by major global businesses. There are definitely some caveats, such as: statelessness, limited native language support, limited execution duration, slow cold start” for JVM-based Lambdas, and concurrent execution limits. We’re excited to see how AWS continues to add functionality and improvements to address some of these needs. Adopting serverless computing has simplified our infrastructure and lowered our AWS monthly bill. Stay tuned for our next blog post, where we’ll cover designing our realtime pipeline to be robust and fault-tolerant.