Multi-Armed Bandits
A bandit problem models sequential decision-making under uncertainty, where every step an agent selects one action (an arm), received a scalar reward, and updates its belief about that arm.
Different than most other reinforcement learning problems, there is no state transition and no delayed reward. The goal is to maximize the cumulative reward over a series of trials by balancing exploration (trying out different arms to learn their rewards) and exploitation (choosing the arm that currently seems best).
$$
\max \sum_{t=1}^{T} r_t
$$
Examples of problems that bandits are well suited for include:
- Clinical trials, where different treatments (arms) are tested and the goal is to identify the most effective one while minimizing patient exposure to less effective treatments.
- Online advertising, where different ads (arms) are shown to users and the goal is to maximize click-through rates by identifying the most engaging ads.
- Recommendation systems, where different items (arms) are recommended to users and the goal is to maximize user satisfaction by identifying the most relevant items.
There are several algorithms to solve bandit problems, including:
- Epsilon-Greedy: This algorithm selects the best-known arm with probability (1-epsilon) and explores a random arm with probability epsilon. This balances exploration and exploitation.
- Upper Confidence Bound (UCB): This algorithm selects arms based on their estimated value and an uncertainty term that encourages exploration of less frequently chosen arms.
- Thompson Sampling: This algorithm maintains a probability distribution over the expected rewards of each arm and samples from these distributions to select arms, balancing exploration and exploitation in a probabilistic manner.
Now let's come up with a practical example where bandits could be deployed.
Let's consider the scenario mentioned above where an online advertiser wants to maximize click-through rates on content they are showing, but does not know the effectiveness of different strategies beforehand.
Imagine they have three approaches they are trialing: a pop-up ad while someone is watching a video (Ad_A), a banner ad at the bottom of the video (Ad_B), and a sponsored reading embedded inside the video being watched (Ad_C).
Each time a user tries to watch a video, they will be shown one of these ad types and the goal is to simultaneously learn which ad type is most effective without losing too many potential clicks by showing less effective ads.
This is a classic exploration-exploitation trade-off scenario that bandit algorithms are designed to solve.
In this case, each ad type represents an arm of the bandit. The reward is binary: either the user clicks on the ad (reward = 1) or they do not (reward = 0).
We can set up a simulation in Python for this kind of scenario where we try to uncover the efffectiveness of the advertising strategies.
In the demo below, each arm is an advertisement.
When an ad is shown, the reward is binary: a click (1) or no click (0).
In the simulation we hard-code the true click-through rates (CTR) to generate clicks, but the bandit algorithms only observe the clicks and must learn which ad is best.
Component 1 — Environment.
Define an
Advertisement class with a hidden CTR, and create three ads with different CTRs.
import numpy as np
class Advertisement:
def __init__(self, name, click_through_rate):
self.name = name
self.click_through_rate = click_through_rate
def pull(self):
if np.random.rand() > (1-self.click_through_rate):
# someone clicked
return 1
else:
# no click
return 0
# Let's consider we have 3 advertisements with different click-through rates and randomness
# For the sake of our simulation, we start by explicitly stating the click-through rates and randomness,
# but in reality, where we don't need to build this part of the simulation, these would be unknown to us.
advertisements = [
Advertisement("Ad_A", 0.3),
Advertisement("Ad_B", 0.5),
Advertisement("Ad_C", 0.7)
]
Component 2 — Helpers.
Two small utilities:
print_click_through_record: summarize impressions/clicks and print estimated vs. true CTR.
plot_click_through_record: plot estimated CTR over impressions and save a PNG in static/.
def print_click_through_record(click_through_record):
# Across these 1000 rounds, what is the number of clicks we got?
total_clicks = 0
for ad_name in [advertisement.name for advertisement in advertisements]:
clicks = [record[1] for record in click_through_record if record[0] == ad_name]
estimated_ctr = np.average(clicks) if clicks else 0
print(f"Estimated CTR for {ad_name}:\n\t{estimated_ctr:.3f} based on {len(clicks)} impressions\n\tReceived clicks: {np.sum(clicks)} across {len(clicks)} impressions")
print(f"Actual CTR for {ad_name}:\n\t{[ad.click_through_rate for ad in advertisements if ad.name == ad_name][0]:.3f}\n")
total_clicks += np.sum(clicks)
print(f"Total clicks across {num_rounds} advertisements: {total_clicks}")
import matplotlib.pyplot as plt
def plot_click_through_record(click_through_record, name):
for ad_name in [advertisement.name for advertisement in advertisements]:
clicks = [record[1] for record in click_through_record if record[0] == ad_name]
cumulative_clicks = np.cumsum(clicks)
impressions = np.arange(1, len(clicks) + 1)
win_rate[ad_name] = cumulative_clicks / impressions
plt.plot(impressions, win_rate[ad_name], label=ad_name)
plt.xlabel("Number of Impressions")
plt.ylabel("Estimated Click-Through Rate")
plt.legend()
plt.title("Estimated Click-Through Rate Over Time for Each Advertisement")
plt.savefig(f"static/rl_1_{name}.png")
plt.close()
Default Strategy: Random Sampling.
This is the “do nothing smart” baseline: pick an ad uniformly at random each round. You get good CTR estimates eventually, but you waste many rounds showing suboptimal ads.
# start a record to keep track of clicks for each advertisement
click_through_record = [] # This will be a list of tuples (ad_name, click_result)
win_rate = {ad.name: [] for ad in advertisements}
# Start a simulation based on random-sampling
# number of rounds to simulate:
num_rounds = 1000
for _ in range(num_rounds):
# select an advertisement randomly
ad = np.random.choice(advertisements)
# simulate pulling the advertisement (showing it to a user)
click = ad.pull()
# record the result
click_through_record.append([ad.name, click])
print_click_through_record(click_through_record)
plot_click_through_record(click_through_record, "random_sampling")
So, putting this altogether, we can see that random sampling produces the following results:
Multi-Armed Bandit Strategy 1 — Epsilon-greedy.
With probability \(\varepsilon\) explore (pick a random arm), otherwise exploit (pick the arm with the highest current estimated CTR):
$$
a_t =
\begin{cases}
\text{random arm} & \text{with prob. } \varepsilon \\
\arg\max_a \hat{\mu}_a & \text{with prob. } 1-\varepsilon
\end{cases}
$$
The
test flag disables exploration (always exploit).
Theory: why epsilon-greedy works. Epsilon-greedy is the simplest explicit exploration rule: it forces exploration a fraction of the time so every arm is sampled infinitely often (if \(\varepsilon\) does not go to zero too quickly), while still exploiting the best current estimate most of the time.
A common way to evaluate a bandit strategy is
regret, the expected loss relative to always playing the best arm \(a^*\) with mean reward \(\mu^*\):
$$
\mathcal{R}(T)=T\mu^* - \mathbb{E}\left[\sum_{t=1}^{T} r_t\right].
$$
Fixed \(\varepsilon\) is easy to understand and implement, but it keeps exploring forever, which typically yields
linear regret in \(T\) (you keep wasting \(\varepsilon T\) pulls on suboptimal arms).
To reduce wasted exploration, practitioners often use a decaying schedule such as \(\varepsilon_t = \min\{1,\, c/t\}\) or \(\varepsilon_t = c/\sqrt{t}\), which increases exploitation over time.
Practical notes. Early in learning, \(\hat{\mu}_a\) estimates are noisy, so epsilon-greedy can lock onto a suboptimal arm if \(\varepsilon\) is too small.
If \(\varepsilon\) is too large, learning is slower because you keep exploring even after the best arm is obvious.
The
test mode in this demo is a useful way to show “pure exploitation” once training is done (or to demonstrate what happens if you stop exploring too early).
class EpsilonGreedy:
def __init__(self, arms, epsilon=0.1, test=False, seed=None):
self.arms = arms
self.epsilon = float(epsilon)
self.test = bool(test)
self.rng = np.random.default_rng(seed)
# Start all estimates equal (uninformed)
self.estimated_val = {i: 0.5 for i, arm in enumerate(arms)} # equal priors
self.n = {i: 0 for i, arm in enumerate(arms)} # number of times each arm was played
def choice(self):
explore = (not self.test) and (self.rng.random() < self.epsilon)
if explore:
#this is just the same as our random-sampling approach earlier!
return self.rng.choice(list(self.estimated_val.keys()))
# exploit: choose ad(s) with highest estimated CTR; break ties randomly
best_val = max(self.estimated_val.values())
best_arms = max(self.estimated_val, key=self.estimated_val.get)
return best_arms
def update(self, arm, click):
"""
Update internal estimates after observing reward.
click is expected to be 0/1.
"""
self.n[arm] += 1
self.estimated_val[arm] = ((self.n[arm] - 1) * self.estimated_val[arm] + click) / self.n[arm]
# start a record to keep track of clicks for each advertisement
click_through_record = [] # This will be a list of tuples (ad_name, click_result)
win_rate = {ad.name: [] for ad in advertisements}
# Create epsilon-greedy strategy
eg = EpsilonGreedy(advertisements, epsilon=0.1, test=False, seed=0)
# Start a simulation based on epsilon-greedy sampling
num_rounds = 1000
for _ in range(num_rounds):
# in our example, this returns a number
ad_i = eg.choice()
ad = advertisements[ad_i]
# simulate pulling the advertisement (showing it to a user)
click = ad.pull()
# record the result
click_through_record.append([ad.name, click])
# update epsilon-greedy estimates
eg.update(ad_i, click)
print("Epsilon-Greedy Strategy Results:")
print_click_through_record(click_through_record)
plot_click_through_record(click_through_record, "epsilon_greedy")
This produces:
Multi-Armed Bandit Strategy 2 — Upper Confidence Bound (UCB1).
UCB adds an “uncertainty bonus” to the estimated mean reward so that rarely-tried arms get explored automatically:
$$
a_t = \arg\max_a \left( \hat{\mu}_a + c\sqrt{\frac{\ln t}{n_a}} \right)
$$
where \(n_a\) is the number of times arm \(a\) has been played, and \(c\) controls how optimistic the bonus is. The implementation below plays each arm once before using the bound to avoid division-by-zero.
Theory: optimism under uncertainty. UCB-style methods pick the arm with the highest
upper confidence bound on its true mean reward. The bound comes from concentration inequalities: for bounded rewards (e.g., clicks in \([0,1]\)), the empirical mean \(\hat{\mu}_a\) concentrates around the true mean \(\mu_a\) as \(n_a\) grows.
A standard heuristic is to choose an uncertainty radius that shrinks like \(\sqrt{\ln t / n_a}\), which is large for rarely-sampled arms and small for frequently-sampled arms.
This makes exploration “automatic”: arms you have not tried much get a large bonus, but the bonus disappears once you have enough evidence they are worse.
Regret intuition. In the stochastic setting, UCB1 achieves logarithmic regret in \(T\) under mild assumptions: roughly, suboptimal arms are pulled only \(O(\ln T)\) times because their confidence bonus eventually becomes too small to compete with the optimal arm’s mean.
The parameter \(c\) in the simplified formula here scales the width of the bonus:
larger \(c\) encourages more exploration; smaller \(c\) encourages faster exploitation (but risks under-exploring).
class UpperConfidenceBound:
def __init__(self, arms, c=1.0, test=False, seed=None):
self.arms = arms
self.c = float(c)
self.test = bool(test) # kept for symmetry; UCB is already deterministic given tie-breaking
self.rng = np.random.default_rng(seed)
# Start all estimates equal (uninformed)
self.estimated_val = {i: 0.5 for i, arm in enumerate(arms)} # equal priors
self.n = {i: 0 for i, arm in enumerate(arms)} # number of times each arm was played
self.t = 0 # total number of pulls so far
def choice(self):
# Pull each arm once first (avoids division by zero; standard UCB initialization)
for i in self.n:
if self.n[i] == 0:
return i
# UCB score: mean + c * sqrt(log(t) / n_i)
# (use self.t as time step; ensure it's >= 1)
ucb_scores = {}
for i in self.n:
bonus = self.c * np.sqrt(np.log(self.t) / self.n[i])
ucb_scores[i] = self.estimated_val[i] + bonus
# Pick the arm with the highest UCB score; break ties randomly
best_score = max(ucb_scores.values())
best_arms = [i for i, s in ucb_scores.items() if s == best_score]
return int(self.rng.choice(best_arms))
def update(self, arm, click):
"""
Update internal estimates after observing reward.
click is expected to be 0/1.
"""
self.t += 1
self.n[arm] += 1
self.estimated_val[arm] = ((self.n[arm] - 1) * self.estimated_val[arm] + click) / self.n[arm]
# start a record to keep track of clicks for each advertisement
click_through_record = [] # This will be a list of tuples (ad_name, click_result)
win_rate = {ad.name: [] for ad in advertisements}
# Create UCB strategy
ucb = UpperConfidenceBound(advertisements, c=1.0, test=False, seed=0)
# Start a simulation based on UCB sampling
num_rounds = 1000
for _ in range(num_rounds):
ad_i = ucb.choice()
ad = advertisements[ad_i]
click = ad.pull()
click_through_record.append([ad.name, click])
ucb.update(ad_i, click)
print("Upper Confidence Bound (UCB) Strategy Results:")
print_click_through_record(click_through_record)
plot_click_through_record(click_through_record, "ucb")
This produces:
Multi-Armed Bandit Strategy 3 — Thompson sampling.
For binary rewards, maintain a Beta posterior for each arm’s CTR. Each round:
- Sample \(\theta_a \sim \mathrm{Beta}(\alpha_a, \beta_a)\) for each arm
- Choose \(a_t = \arg\max_a \theta_a\)
- Update the posterior with the observed click: \(\alpha_a \leftarrow \alpha_a + r\), \(\beta_a \leftarrow \beta_a + (1-r)\)
The
test flag switches to choosing the maximum posterior mean instead of sampling.
Theory: Bayesian posterior sampling (probability matching). Thompson sampling assumes each arm has an unknown parameter (here the CTR), maintains a posterior distribution over that parameter given the data, and then samples a plausible CTR from each posterior to decide which arm to play.
This is sometimes described as
probability matching: the probability you play arm \(a\) equals the posterior probability that arm \(a\) is optimal.
Exploration emerges naturally because uncertain arms have wider posteriors and sometimes “win” the sampling step.
Beta–Bernoulli conjugacy. With binary rewards, a Bernoulli likelihood and a Beta prior are conjugate:
starting with \(\theta_a \sim \mathrm{Beta}(\alpha_0,\beta_0)\), after observing clicks \(r\in\{0,1\}\) the posterior stays Beta with
$$
\alpha_a \leftarrow \alpha_a + r,\qquad \beta_a \leftarrow \beta_a + (1-r).
$$
The posterior mean used in
test mode is
$$
\mathbb{E}[\theta_a] = \frac{\alpha_a}{\alpha_a+\beta_a}.
$$
Choosing \(\alpha_0=\beta_0=1\) is a uniform prior; increasing both values makes the prior more “confident” and slows early learning.
Regret intuition. In many stochastic bandit settings, Thompson sampling also achieves logarithmic regret (often matching UCB up to constants) while being simple and effective in practice.
It tends to behave sensibly in the early phase: if an arm has not been tried, its posterior is broad, so it has a meaningful chance to be selected without requiring an explicit exploration rate.
class ThompsonSampling:
def __init__(self, arms, test=False, seed=None, alpha0=1.0, beta0=1.0):
self.arms = arms
self.test = bool(test)
self.rng = np.random.default_rng(seed)
# Beta prior per arm: Beta(alpha, beta)
self.alpha0 = float(alpha0)
self.beta0 = float(beta0)
# Start all estimates equal (uninformed)
self.estimated_val = {i: 0.5 for i, arm in enumerate(arms)} # for plotting/inspection
self.n = {i: 0 for i, arm in enumerate(arms)} # number of times each arm was played
# Posterior parameters (initialized to prior)
self.alpha = {i: self.alpha0 for i, arm in enumerate(arms)}
self.beta = {i: self.beta0 for i, arm in enumerate(arms)}
def choice(self):
if self.test:
# Deterministic "test": choose highest posterior mean
posterior_means = {i: self.alpha[i] / (self.alpha[i] + self.beta[i]) for i in self.alpha}
best_val = max(posterior_means.values())
best_arms = [i for i, v in posterior_means.items() if v == best_val]
return int(self.rng.choice(best_arms))
# Sample theta_i ~ Beta(alpha_i, beta_i) for each arm, pick argmax
samples = {i: self.rng.beta(self.alpha[i], self.beta[i]) for i in self.alpha}
best_sample = max(samples.values())
best_arms = [i for i, s in samples.items() if s == best_sample]
return int(self.rng.choice(best_arms))
def update(self, arm, click):
"""
Update Beta posterior after observing reward.
click is expected to be 0/1.
"""
click = int(click)
self.n[arm] += 1
self.alpha[arm] += click
self.beta[arm] += (1 - click)
# Maintain an "estimated_val" (posterior mean) for easy printing/plotting
self.estimated_val[arm] = self.alpha[arm] / (self.alpha[arm] + self.beta[arm])
# start a record to keep track of clicks for each advertisement
click_through_record = [] # This will be a list of tuples (ad_name, click_result)
win_rate = {ad.name: [] for ad in advertisements}
# Create Thompson sampling strategy
ts = ThompsonSampling(advertisements, test=False, seed=0, alpha0=1.0, beta0=1.0)
# Start a simulation based on Thompson sampling
num_rounds = 1000
for _ in range(num_rounds):
ad_i = ts.choice()
ad = advertisements[ad_i]
click = ad.pull()
click_through_record.append([ad.name, click])
ts.update(ad_i, click)
print("Thompson Sampling Strategy Results:")
print_click_through_record(click_through_record)
plot_click_through_record(click_through_record, "thompson_sampling")
This produces:
Summary.
Random sampling learns CTRs but wastes many rounds. Epsilon-greedy explores at a fixed rate, UCB explores based on uncertainty, and Thompson sampling explores probabilistically via posterior sampling.
There are different variants of bandit problems, including stochastic, adversarial, contextual, and non-stationary bandits.
There are different variants of bandit problems, including:
- Stochasic Bandits: In this variant, each arm has a fixed but unknown probability distribution of rewards. The agent's goal is to learn these distributions over time and select the arm with the highest expected reward.
- Adversarial Bandits: In this variant, the reward distributions can change over time, potentially in response to the agent's actions. The agent must adapt its strategy to perform well against an adversarial environment.
- Contextual Bandits: In this variant, the agent also observes some context or features before selecting an arm. The goal is to learn a policy that maps contexts to arms to maximize rewards.
- Non-Stationary Bandits: In this variant, the reward distributions can change over time independently of the agent's actions. The agent must continuously adapt to these changes to maintain good performance.