Evaluating Machine Learning Predictions: Customer Churn & CLV [RS Labs]

At Retention Science, we are committed on making machine learning and artificial intelligence more accessible and understandable. This blog introduces our process of evaluating the accuracy of two crucial predictive models, Customer Churn Prediction and Customer Future Value (CFV). These two predictions provide invaluable insights on how to keep customers engaged.

Our evaluation framework purpose is twofold. Internally, it helps us choose the best performing predictive models for the prediction problem at hand. Secondly, it serves as a reporting tool for the marketer to examine the prediction accuracy of models.

Churn Prediction Evaluation:

Methodology:

In our earlier blog post we described how we built and tune our churn models. At a given date our models predict a probability of churn for each user. Since these raw probabilities are not actionable, we bucket these probabilities into three segments based on the probability: low, medium, and high churn groups.

Once the prediction is done, we wait for a hold out time — usually about 3 months — to evaluate how the models performed during that period. By using that holding period as the baseline, we can define the labels of what we consider churn. For instance, we can say that a shopper who didn’t purchase within those 3 months churned. We then proceed to evaluate different metrics.

Churn Evaluation Metrics:

The metrics are divided into 2 categories:

  • Evaluation metrics related to churn segments
  • Binary classification performance

The first category of metrics show how discriminative the churn groups were. We provide churn rates for each churn group, the average number of orders made per user in a group in the hold out time as well the average order price made per user in a group (See Fig 1.)

This informs the marketer on how discriminative the churn segments behaved. Marketing decisions can be made based on how the different segments behaved in relation to each other. For example, it would be helpful to know how much high-churn probability users bring on in average as compared to the low-probability churn users when marketers make decisions on incentives.

The second category of metrics evaluates the raw churn score predictions. We use this to compute evaluation metrics such as

  • F-measure
  • ROC curve
  • Precision-Recall curve
  • Reliability Curve

These different robust metrics give the data scientist/analyst in your marketing team information on how well the churn models were able to classify churners and non-churners — for instance, how many people who were predicted to churn actually did — as well as possibly compare it with a standard baseline. Each metric is robust to different distributions in churners and non-churners and brings its own insight on the performance. The reliability curve calibrates the the probabilities to true churn rates, leading to better interpretability.

figure 1.1

fig 1.1b

Fig 1.1 Churn Segment Results shown on RS-Sauron

fig 1.2

Fig 1.2 Measurements evaluating the binary classification

 

CFV Validation:

We define customer lifetime value (CLV) as sum of the past value (observed component) and the future value (predictive component) of a user. An important CRM prediction is how much value a user will bring in the future — at ReSci, we use a full-margin analysis.

Methodology:

Similar to the churn models, at a given date our models predict an expected value a user will bring in within the next ‘n’ months, and bucket them into three groups based on the value: low, medium, and high values. We wait for a hold out time (‘n’ months), and then see how the future unfolded — and how accurate our predictions were.

Once we have the true value a user bought in, that number can be used to evaluate the value we predicted that user would bring in. CFV evaluation metrics are divided into 2 categories:

  • CFV overall accuracy metrics
  • Group CFV discrimination metrics

The overall accuracy metrics give both user-level accuracy and site-level (companywide) accuracy. User level accuracy shows how many predicted user values were overshot, how many we got exactly correct and what was the mean absolute error. It provides other statistical measures like Pearson’s Correlation and RMSE.

fig 2.1

Fig 2.1. Results depicting overall future revenue evaluations

fig 2.2

Fig 2.2 Results depicting CFV Group discriminations

 

The second category of metrics are the group-level discrimination metrics. Marketers have at their disposal 3 cohorts of users in terms of future value: low, medium, and high-value customers, as mentioned previously. We then measure how these cohorts performed — how much value each cohort brought in on average, as well as how many purchase orders each cohort brought in.

Conclusion:

To reduce the debt of Machine Learning and automate monitoring we’ve established email alerts that warn us if any of these evaluation parameters fall off even slightly. This way, we’re able to keep track of our models’ performance and ensure we stay running in top shape.

PS: Questions, or just want to learn more? Feel free to contact us at ds@retentionscience.com for any questions related to Churn and CLV Predictions/Evaluations.