Multi-armed bandits, and the Duolingo example

Duolingo, the company behind the famous language learning app, face a similar challenge to the majority of companies whose existence largely depends on usage of an app or website. They want to promote “user engagement”. That’s to say it’s in their interest – and fortunately in this case also in the interest of their customers – to have people open their app regularly in order to take the language lessons they offer.

One method they use to encourage this behaviour is to send various push notifications to their users, with the underlying message of “It’s time to start a lesson!”, some of which are memorable enough that they’ve inspired a set of somewhat ominous memes.

But, given a big set of different messages all designed to encourage app engagement, how best to decide which specific notification to send a particular customer if they’re looking for the maximum efficacy in getting the customer to open the app?

A couple of years ago they were generous enough to publish a paper on the topic. In their case they were using a variation on a multi-armed bandit method.

Before we dig deeper below, I wanted to mention that I have no involvement with Duolingo other than being a customer of their services. The examples from them that I use below are simply what I gleaned from their paper, and all mistakes are no doubt my own.

What is a multi-armed bandit?

The one-armed inspiration

Nothing to do with Dick Turpin of course, but it is a metaphor for another real life phenomenon. It’s a slang term for gambling slot machines, based on their ability to reward you (or rather, rip you off) with jackpots based on an unknown probability distribution every time you pull the single arm that older machines tended to be operated via.

Conceptually, a multi-armed bandit is a set of multiple one-armed slot machines.

Imagine that these slot machines are all different, and may each have different probabilities of winning the jackpot. For example one machine might have a probability distribution such that 50% of the time you hit jackpot joy, whereas another has 70%, and another just 30%.

If you were somehow compelled to play these machines for 100 rounds, then obviously you’d want to repeatedly pull the arm of the 70% payoff machine. But here we’re imagining that you can’t tell just from looking at the machine what its rate of hitting the jackpot is.

All you can do in this scenario would be try them out and see what happens. Sometimes you’ll win, sometimes you’ll lose. But if whilst doing that you could figure out which the 70% jackpot machine is and target most of your arm-pulls to that one, then you’ll do a lot better than if you invest in the others. The problem you would wish to solve is therefore: which of these slot machine arms should you pull, in which order, to get the maximum expected payoff?

This is basically the same problem that Duolingo faces with regards to sending its notifications, except we need to replace the pulling the arms of fruit machines with pushing notifications, and big-money jackpots with the “reward” of getting a user to open the app and take a lesson.

It’s likely that some push notifications are on average more effective than others – perhaps resulting in the user opening the app 70% of the time instead of the 50% of the time another notification produces.

But when trying out new notification ideas, you don’t know in advance which the most effective notification is. If you did, you’d just send that one each time. Like the metaphorical fruit machines, the only way you can learn anything about how effective a new notification is to “pull its arm”, sending it to some users, and seeing what happens.

The tension between exploration and exploitation

When choosing which arm to pull in these scenarios there’s always a trade-off between “exploration” and “exploitation”.

During exploration, you’re trying out new ideas, new strategies, new combinations you never tried before, in order to try and learn something about how effective they are. This is the only route towards improving your knowledge, but it’s also inefficient in the sense that if you happened to already be choosing the most rewarding option, you’re wasting some of its potential by experimenting with inferior options.

During exploitation, you’re taking what you’ve learned from your exploration, and using that to gain the highest expected reward from what you know. If you’ve figured out that one arm seems to have a distribution centered around a 70% success rate and the other has a similar distribution but this time centered around a 50% success rate, then you will exploit this knowledge and keep pulling on the 70% arm.

This gives you the biggest rewards based on what you know so far, but it also doesn’t let you learn anything new about possible better options, or changes in the performance of options over time.

One can imagine the in-experiment phase of an basic A/B test as being a 100% explore phase. Perhaps you’re sending notification A to half your users at random and notification B to the other half. After a few days you analyse the outcomes in each case and decide which is better.

This experiment exists to let you learn something new, but you don’t exploit that knowledge until after the experiment is ended. If notification A proves immediately superior very early on in the experiment then…nothing changes. Per your experiment’s design, you should keep on selecting users to receive notification A or B at random, even though in theory you might already have shown that A is better.

After an A/B test ends, you typically enter a 100% exploitation phase. Now you know which notification works best, you send that to all eligible users. Assuming your finding holds, this maximizes the return from what you learnt. After all, why would you send a known inferior option?

But if there are even better possibilities out there – a notification C – or if the relative effectiveness of the chosen notification changes over time, then you’ll never discover them unless you run a new test.

The multi-armed bandit problem tries to address the tension between exploring new options and exploiting known outcomes by continuously using some algorithm to choose when to pull the arm we know has had the best results historically, vs when to try different or new arms.

One fairly easy to understand algorithm that’s used in this context is to always choose the option that historically worked best a certain percentage of the time, in say 90% of decisions, but selecting an option entirely at random for the other 10%.

In 90% of cases you’re exploiting what you already know, which as you learn more and more should provide more and more reward. But with the remaining 10% you give yourself a chance to learn something new. A strategy like this is known as the epsilon greedy algorithm.

Limitations and workaround for the basic multiarm bandit method

The Duolingo paper highlights several limitations with the standard type of multi-arm bandit algorithm outlined above which made it less suitable unsuitable for their particular notification-testing use-case.

The novelty effect

Past analysis had shown Duolingo that notifications that were “fresh” for a given user were intrinsically more engaging that those that weren’t. That is to say, if the user received the same notification 5 times in a row, even though the content didn’t change, its effectiveness seemed decrease.

In terms of the explore/exploit tradeoff, they wanted to both be able to exploit this novelty effect, and also explore with the context that the initial results one would see for any particular new notification idea may be boosted simply because it’s new, not necessarily because it’s better.

The paper details that they handled this by adding a “recency penalty” to any arm if it has previously been chosen for the user in question. This penalty is set to reduce over time, based on the idea that human memory can be modelled as an exponential decay function.

This approach is called a “recovering arm”.

Conditional arms

A basic multi-armed bandit algorithm assumes that all options are available to all users at any given time. This isn’t actually true in Duolingo’s notifications strategy. Some notifications are only available for users who meet certain conditions.

This complicates things because a notification which is only ever sent to users who are already highly engaged (perhaps “Well done on your 1000 day streak!”) will potentially always be more successful as measured by whether the user goes on to open the app than one sent to very disengaged users (“We are sad you didn’t you open your app for 1000 days”).

But knowing the former notification provides much higher “reward” than the latter is greatly not useful when they reflect the behaviour of two very different populations. Plus in this example, you’re never trying to decide between them for any individual decision.

Their solution here is to measure the efficacy of each arm by looking at the relative difference between the rewards when a particular arm is chosen vs the occasions when the arm was eligible but in fact a different arm was chosen. As these events both imply that the user was eligible for the studied notification, the effect of the criteria that made them eligible is controlled for. A weighting is then applied to correct for the bias that any changes in notification policies over time could have when calculating a basic average.

This approach is called a “sleeping arm”.

Rewards that change over time

As with almost any intervention of this nature, it could be that the “reward” a particular arm produces changes over time. The Covid-19 pandemic has provided an perfect, awful, example of this. The public’s priorities, sensitivities and behaviour were very likely affected in many ways. In less turbulent times fashions still change, there may be seasonal or other trends to what you’re measuring, or perhaps the underlying demographics of your user base alters over time.

Conventional AB tests with their 100% explore followed by 100% exploit methods don’t have any built-in way of dealing with this as such. The answer would be to keep running new AB tests as time goes on to validate or update previous learnings. Predictive models do usually need to be re-validated and updated regularly.

But even a “continuous” experiment such as this multi-arm bandit that relies at least somewhat on historical data may not adapt very quickly to changes if the effect of the mass of old data outweighs a relatively smaller amount of newer data.

In the 90% exploit 10% explore scenario we discussed above, the 10% of arms that are selected for exploration purposes will give us up to the minute information on how the notification is performing in today’s climate.

But if the “exploiting” 90% of notification selections are calculated based on what has worked historically, perhaps even over many years, then it will take a long time for any newer findings to make an impact on the aggregate averages of the whole dataset.

Duolingo handle this by just deleting some of their historical data their model uses. A smaller mass of old data means that the newer data will have more impact on the decisions being made. But it’s obviously not a cost-free solution. By removing data, you reduce your sample sizes, meaning that your performance estimates are necessarily less certain. To ensure they don’t delete too much, they ensure that they keep in enough historical data such that the 95% credible interval for an arm’s score is within 0.15 percentage points.

How much historical data this implies should be kept varies a lot between arms, based on how many times they were eligible for selection and their performance. From a chart they show in the article, it seems like they keep history for some arms for less than 25 days, whereas others stay in the dataset for 10 times longer than that.

Arms with small sample sizes

This one isn’t particularly a bandit-related issue, other than that the high variance associated with small sample sizes may mean that some arms initially show a performance so low due to random chance that they’re very rarely chosen again in the future. If an arm is only very rarely selected then it will take a long time to procure enough data to stabilise the estimate towards its true value.

To combat some of this, Duolingo apply Bayesian techniques in order to shrink the differences towards a prior that is based on the results of larger arms.

Full credit is due to Duolingo for sharing their work!

I often find reading the details of how other organisations solve common problems in a data-driven way both interesting and educational, even when it’s not a problem I’m personally immediately working on.

Although happy to be very public about being “data-driven” et al. many companies seem fairly secretive about how they actually use the data to drive anything. I applaud Duolingo for publishing the details of the problem they were trying to solve, what their solution was, and to what extent it worked.

They even shared a 200 million row dataset of training and test data one can use to replicate their work. I hope more organisations, commercial and otherwise, will follow that path.

One thought on “Multi-armed bandits, and the Duolingo example

Leave a comment