The Smart Marketer: When to Use Multi-Armed Bandit A/B Testing

What if as a marketer you could run 10 A/B tests within a week without lifting a finger instead of the standard monthly testing? You could be getting a significant increase in productivity and performance if you do it right.

A/B testing is a standard step in the marketing process. Without A/B testing, marketers wouldn’t have the necessary data points to maximize their marketing efforts and drive an effective campaign. The A/B test is mainly used when you want to see what treatment is causal to the results you want, or when you want to know which of the many possible actions leads to the best results. In the latter case, the standard A/B test turns out to not be the best way to get the desired results.

In a simple A/B test, we sample the data and run the test over a period of time to see which behavior is optimal. This A/B test can be done once a month. But what if I want to do these tests 10 times in a week?

Running an A/B test after sampling each piece of data, and then using the best results requires quite a bit of mental energy and time consumption. Instead, what if you could put all the possible actions on the traffic you wanted and let the machine pick the best action for you? Multi-armed bandits are the better way to do this. But before we dive into any real-world examples of multi-armed bandits, let’s get educated.

What is a multi-armed bandit?

Multi-armed bandit isn’t familiar lingo in the marketing world but it’s increasingly becoming a part of a marketer’s day-to-day function whether one realizes it or not. When you switch to an all-purpose, automated marketing platform such as Cortex, chances are you’re dealing with multi-armed bandits. So, it’s best to familiarize with it. Now, let’s look at each term:


A marketer might ask: Which options are useful for us? What kind of actions can we take? The possible options or actions are called arms. For email marketing, possible email subject lines or email templates in a campaign can be called arms.


A bandit is a collection of arms. We call a collection of useful options a multi-armed bandit. The multi-armed bandit is a mathematical model that provides decision paths when there are several actions present, and incomplete information about the rewards after performing each action. The problem of choosing the arm to pull is called the “multi-armed bandit problem.”

Now that we understand what multi-armed bandit means, it’s time to get a high level picture of how the multi-armed bandit works.

Suppose that we have two email templates, such as template A and B, for marketing campaigns for new sign-up users. The two email templates are the arms for our multi-armed bandit, and the direct metric to evaluate these templates is the click-through rate (CTR).

Since we’ve never tested these templates before, we’ll assume that these two templates have the same expected CTR such as 0.5.

In the first round, we send 50 emails with template A and another 50 emails with template B based on the same expected CTR assumption. Afterwards, we can see which emails have clicks or not, and then calculate the CTR for each template from clicks and impressions. The observed CTRs are used to update our initial assumptions about each template. If template A has 0.1 CTR and template B has 0.05 CTR from the first round, the CTR assumptions for the next round follows the observations.

In the second round, we randomly generate expected CTR for template A and template B, then choose a template with a higher expected CTR. From here, we can send 80 emails with template A and 20 emails with template B.

When we update the expected CTR assumption after several rounds, then our assumption CTR will be adjusted to the observed CTR for each template.

There are many algorithms to implement multi-armed bandits. We use a Bayesian model. The advantage of the Bayesian model is that we can easily incorporate the observations into the assumptions, and improve the assumptions with higher confidence over time.

Initially, when we look at the two template examples we assume that these templates have the same expected CTR. Of course, it turns out this expected CTR is different from the real observed CTR. No big deal, we can simply update our assumption.

Let’s say our assumption for template A is 0.5 CTR and the observation CTR for template A is 0.1 (5 out 50). The initial CTR assumption is called a prior in statistics. The prior is something we believe to be true before we have any evidence or observation. To model the prior using statistics, we use a beta distribution.

The beta distribution is a probability function that models the probability of success when there are many trials that can result in either a success or failure. The modeling is done by two parameters. Put succinctly, one parameter refers to the number of successes, and the other parameter refers to the number of failures. The number of successes and failures can be an arbitrary number when there’s an absence of observation. In our example, we can set both the number of successes and failures, one for each template.

Since we have observations for two templates, the observations can be modeled after binomial distributions. The binomial distribution can be thought of as the number of successes from several trials, such as sending emails.

We already know the beta distribution has two parameters: success and failure. We can then update the beta distribution or our assumption based on observations in order to update the success and failure parameters.

Once we set assumptions as beta distributions and observations as binomial distributions, then the update of the beta distribution is simple.

multi-armed bandit (1)

These equations are derived from the relationship between the beta and the binomial distributions. We don’t need to reveal the details regarding how the equations are derived. It’s enough to know that the beta distribution is a conjugate prior when the observation is the binomial distribution.

From the first round, we know template A has 5 successes (clicks) and 45 failures. It then follows that the updated success is 1 + 5 = 6 and the updated failure is 1 + 45 = 46 from equation 1. The expected CTR for template A is 6 / 46 = 0.13.

Real World Example

Let’s look at some real world examples. One of our e-commerce clients wants to test several email templates and wishes to maximize the click-to-open rate (CTOR). The target campaign is the welcome email campaign for new signup users who are predicted to have high intents to buy. Our client has prepared four different templates, and would like to figure out which template will work best. Let’s run the multi-armed bandit.

Figure 1. gives an idea of how the multi-armed bandit chooses the best templates from these four. The top graph shows cumulative CTORs for four different templates over time. The x-axis is the date and the y-axis is cumulative CTOR. The bottom graph shows the percentage of daily sent emails for the target campaign. The x-axis is the date and the y-axis shows the percentage of daily sent emails. Every day the total percentage of daily sent emails is 100%.

On day 1, all four templates have the same beta distributions (prior beliefs) and each template has 25% of daily email sends. We can see that the areas from four templates on day 1 are similar to each other. Once we receive feedback from your users, our beliefs must change based on these observations. Over time, the winner becomes increasingly evident by looking at the CTORs of the templates. From day 2, template A is discovered to be the winner. Even though we see the winner, all the templates have around 25% of daily emails up to day 9.

multi-armed bandit (2)
Figure 1. Cumulative CTOR and percentage of sent emails

You might ask why each template has the same amount of emails even though the CTOR shows the winner. Template A may appear to have won, but the machine isn’t as certain until a certain amounts of emails are tested. In a hypothesis test the difference isn’t clear whether the top performing template is better than the second best on day 1.

multi-armed bandit (3)
Figure 2. CTOR difference between Template A and Template D on Day 1

When we look at Figure 2, Template D (the best one) CTOR is higher than Template A (the second best one), the difference isn’t big enough. We can also look at some statistical measures to make sure whether two CTORs aren’t the same, which is called a p-value. When we calculate a p-value from a chi-square test, the p-value is 0.8. In general, when the p-value is less than 0.05, then we can say that the two CTORs are not the same. Since the p-value is 0.8, it’s hard to conclude there’s any major differences.

multi-armed bandit (4)
Figure 3. CTOR difference between Template A and Template B on Day 11

Now, let’s look at what happened on Day 11 in Figure 3 where Template A wins the second best on Template B. The p-value on day 11 between these two CTORs is 0.038 which is smaller than 0.05. Now we can say with statistical confidence the clear winner. When we look back to Figure 1, one can observe that the majority of daily email use are from Template A over time.


We have learned what multi-armed bandit is, how it works and what benefit it brings. Compared to traditional A/B testing, traffic spend is lower and marketers don’t need a copious amount of time to figure out which action/arm is the winner. The bandit automatically finds it. This is useful for marketers who need to run several campaigns with different actions at the same time. But what happens after? What best practices should marketers take when the winner is known? Stay tuned or shoot us a demo request and get involved in the future of marketing. In the mean time, check out some of our resources that will help you on your marketing journey.