Automating Machine Learning Monitoring [RS Labs]

This blog takes a small dive into one of our internal monitoring tools that overlooks our entire ETL pipeline and helps us stay on top of our machine learning models.


Imagine if what viral polite grandma was thinking when she was typing in her search query was actually true: that there is a human operator on the other side of the screen accepting queries on his/her whim, who then manually searches through websites and returns a list of relevant results. It would be in such a pre-automated world where you could also imagine data scientists manually monitoring a long list of machine learning models for a long list of clients. Google would need one hell of a customer service department to cater to their 40,000 searches per second, and we data scientists at Retention Science would not have enough time to refill our coffees.

One word: Scalability. Well, actually two words: Scalability and Laziness. You want to do the same thing you do for one client across a 100 clients? Automate. You’re lazy and hate monotonous labor? Definitely automate! The latter turned out to be an earlier (and bigger) motivator for us at ReSci to start automating most of our daily and weekly monitoring tasks.

In ancient times, each data scientist would get assigned a diverse list of clients and the task would be to monitor all kinds of models for them once a week. We would go into our database, see textual reports for all models running for each client and probably run a script to draw a chart or two to help our curious eyes.

However, when you have such a sophisticated Artificial Intelligence platform with over 50 models running across many clients like we do at Retention Science, it screams of work. And with that list of clients and models increasing, that pile is just going to get stacked higher and higher.

Enter Sauron:


No, not the evil tyrant Sauron. Sheath your swords. This is our all-seeing, one-service-to-view-them-all Internal Machine Learning Monitoring System, which we also call Sauron. Here’s a brief background on its evolution.

Data Visualization:

Every family of machine learning models in our system comes with its own custom reporter, which publishes a list of important statistics regarding the output of the model. For example, for our recommendation models, we would want to see what the recommended-item distribution looks like, how many users the model covers, and how many of the active items are being represented, etc.

Being able to digest these numerical outputs with charts and colors was naturally the first priority for our team. We created an application which we named Sauron, where we would be able to see how our models were performing across all clients through cool data visualizing libraries. Another added advantage of this application was that we could easily compare different models from the same suite (e.g churn) side-by-side, which made it easier to make A/B test decisions. Today, we have a separate A/B Framework API where we can easily set up any kind of unbiased test among different models, but more on this in another blog.

body image 1

Another big win was being able to view our model performances over time, which was previously very hard to do as it would involve manually opening up every report for historic dates. These time-series charts helped us a great deal in learning more about our models and how they perform for each client temporally. We made it a standard practice to add a monitoring UI for every new model that we added to our toolset. This proved to be a great addition to our stack.

Monitoring Dashboard:

We set up weekly tasks where each one of our data scientists would have a checklist to monitor for a list of their assigned clients. A few examples of these checks would be:

  • Does the default recommendation scheme have good user and item coverage?
  • Does the churn distribution look normal? Is the area under ROC and PR Curve acceptable?
  • Is there good distinction among the different cohorts of CLV and Churn?
  • Is the error for predicted site-level CLV evaluated over 3-months in an acceptable range?

These painstaking but important checks took up most of our energetic Monday mornings. But it was working. We caught a few bugs here and there, tweaked model parameters and even switched models depending on our observations. This feedback loop helped our models to learn even better. But since we had more models turned on than were actually being used, and each model had a lot of information, we agreed to make a single monitoring dashboard that would only show the most important and sought-for statistics for only the models that were actually powering emails.

body image 2

Fig: A typical monitor, showing some of the most relevant metrics for our models.

The monitor would soon become our one-stop place for our weekly checks. We could see the most important distributions and time-series charts here and easily point out any anomalies. In case we needed more probing, we would go back to that specific model dashboard and learn more. The monitor also included a delta indicator next to each important metric, showing whether its value went up or down and by how much compared to the previous day’s value. A quick glance at these variations would be enough to see how things were going.

Automation: Alerts

Even though we were spending less time per client now, Sauron was still not a scalable solution yet. The more clients that were added to the list, the harder it became to give adequate attention to each client while also balancing other projects. Remember the whole motivating story about Automation above? We knew what we had to do to scale.

Our team brainstormed and came up with a few key metrics and thresholds that would define whether a particular behavior was worthy of an alert or not. We had two types of checks:

    • Absolute Checks (e.g user-coverage should not be equal to 0)
    • Relative Checks (e.g. today’s user coverage should not fall more than 20% from yesterday’s)

The idea is simple. If something doesn’t look or change the way it should, then something is wrong. So once our ETLs were finished for the day, these checks would be run for all models across all clients and qualifying alerts would then be compiled and emailed to the data science team to act upon.

The benefits of establishing an alert system were two-fold. Firstly, we caught more bugs, and these were caught at the point of creation. Secondly, it eliminated human error in correctly identifying anomalies and saved valuable time for key stakeholders.

body image 3

Fig: A highly summarized view of part of our stack and how Sauron fits into it

At Retention Science, we have a well-monitored end-to-end pipeline that seamlessly runs for all of our trusted customers. Our data science team has achieved enough abstraction to avoid getting into the weeds unnecessarily while also having more confidence in our supervision, allowing us to improve our productivity and scale effectively. Our mission is to ensure that we draw our gaze to the right spot at the right time and do not let any hobbits elude us while heading towards Mount Doom.