Monte Carlo Methods in Reinforcement Learning
So far we relied on a dynamics model and a reward model to reason about the future. In many real tasks we do not know either: building an accurate dynamics model can be harder than finding a good policy, and reward signals are often sparse (e.g., success/failure at the end of a quadcopter trick, win/loss at the end of a game).
Typical settings with unknown or costly models include:
- Competitive games such as chess, backgammon, or Go.
- Complex locomotion and driving tasks where dynamics are hard to write down exactly.
When the model is unknown but can be sampled through interaction, we can
learn the optimal strategy instead of
planning it. Monte Carlo (MC) methods do this by estimating state-action values \(Q^\pi(s,a)\) from experience.
Monte Carlo intuition (π from random samples).
MC approximates a value by successive random samples. To estimate \(\pi\):
- Area of a square: \(A_{\text{square}} = l \cdot w\).
- Area of a circle: \(A_{\text{circle}} = \pi r^2\).
- Set \(r = 0.5\), \(l = w = 1.0\). Then \(\frac{A_{\text{circle}}}{A_{\text{square}}} = \pi/4\), so \(\pi = 4 \cdot \frac{A_{\text{circle}}}{A_{\text{square}}}\).
If we uniformly sample points in the square, the fraction that land inside the circle converges to \(\pi/4\). The same averaging idea underpins MC value estimation.
Monte Carlo policy evaluation (for \(Q^\pi\)).
We initialize a policy \(\pi\) (often \(\varepsilon\)-soft). Run an episode:
\(\pi: S_0, A_0, R_1, \dots, S_{T-1}, A_{T-1}, R_T\)
Imagine a grid-world where states are angle and angular rate (these could represent anything in practice). We play an episode and obtain a trajectory (dark to pale over time).

Work backward from the terminal state using returns (the return for a state-action pair is the discounted sum of rewards after taking that pair):
$$
G_t(s_t, a_t) = R_{t+1} + \gamma G_{t+1},
$$
with \(G_{t+1}\) the accumulated return beyond the next step. Example along a path:
$$
Q_\pi(\{6,7\}, A_{T-1}) = 5 + 0 = 5, \qquad
Q_\pi(\{5,7\}, A_{T-2}) = 5 + \gamma \cdot 5, \dots
$$
Unifying samples across episodes.
For each state-action pair, store the list of observed returns and take the average:
$$
Q(S_t, A_t) = \text{Average}(G_t).
$$
Memory considerations.
\(N\) episodes of length \(M\) imply \(N \times M\) stored returns. To avoid unbounded memory, use an incremental mean:
$$
Q(s_t, a_t) = \frac{(N-1) \cdot Q_{\text{old}}(s_t, a_t)}{N} + \frac{G_t}{N},
$$
which only needs a running mean and a visit count per state-action pair.
Fixed step size (forgetting factor).
Recent returns are often more informative than early ones. Replace \(1/N\) with a constant step size \(\alpha\):
$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[G_t - Q(s_t, a_t)\big].
$$
This reduces training time by preventing early noisy returns from anchoring estimates.
Monte Carlo policy iteration (on \(Q\)).
Combine Q-updates with policy improvement:
- Policy evaluation: update \(Q\) using the returns from the episode.
- Policy improvement: \(\pi(a \mid s) = \arg\max_a Q(s,a)\).
Greedy-only improvement can ignore unvisited but promising actions, so we soften the policy.
\(\varepsilon\)-soft policy.
$$
\pi(a \mid S_t) \leftarrow
\begin{cases}
1 - \varepsilon + \varepsilon / |A(S_t)|, & a = A^* \\
\varepsilon / |A(S_t)|, & a \neq A^*
\end{cases}
$$
This keeps exploring while favoring the current best action. In simulation you can also use exploring starts (sample random initial state-action pairs) to improve coverage. Empirically, as \(N \to \infty\), \(Q_\pi(s,a) \to Q_{\pi^*}(s,a)\) and \(\pi(a \mid s) \to \pi^*(a \mid s)\), though a full proof is still elusive.
Properties and caveats.
- First technique in this series that learns directly from experience (no model required).
- Updates move backward from the episode's end, one step at a time.
- Sample inefficient: needs full episodes and only updates visited state-action pairs.
- Ambiguity can persist about which actions caused success (e.g., two pendulum trajectories ending at the same setpoint).
Worked example: Monte Carlo control on Blackjack (Gym).
Below is a walk-through of
python_scripts/rl_4_montecarlo_control.py, which learns \(Q(s,a)\) for the Blackjack-v0 environment using first-visit MC control with an \(\varepsilon\)-soft policy.
1) Imports and entry point. Main sets up the environment, agent, training loop, and win-rate tracking.
import numpy as np
import gym
import matplotlib.pyplot as plt
if __name__ == "__main__":
env = gym.make("Blackjack-v0")
agent = Agent(eps=0.001)
n_episodes = 200_000
win_lose_draw = {-1: 0, 0: 0, 1: 0}
win_rates = []
for i in range(n_episodes):
if i > 0 and i % 1000 == 0:
win_rates.append(win_lose_draw[1] / i)
if i % 50_000 == 0:
print("starting episode", i, "winrate %.3f" % (win_rates[-1] if win_rates else 0.0))
state = env.reset()
done = False
while not done:
action = agent.choose_action(state)
next_state, reward, done, _ = env.step(action)
agent.memory.append((state, action, reward))
state = next_state
agent.update_Q()
win_lose_draw[reward] += 1
plt.plot(win_rates)
plt.show()
2) Agent skeleton and stored data. The agent maintains \(Q\), per-visit returns, a first-visit flag per state-action, and episodic memory of (state, action, reward) tuples.
class Agent:
def __init__(self, eps=0.1, gamma=0.99):
self.Q = {}
self.sum_space = [i for i in range(4, 22)]
self.dealer_show_card_space = [i + 1 for i in range(10)]
self.ace_space = [False, True]
self.action_space = [0, 1] # stick or hit
self.state_space = []
self.returns = {}
self.pairs_visited = {}
self.memory = []
self.gamma = gamma
self.eps = eps
self.init_vals()
self.init_policy()
3) Initialize Q and an \(\varepsilon\)-soft policy. Enumerate every state (player sum, dealer showing, usable ace) and every action, seed Q to 0, and start with a uniform policy.
def init_vals(self):
for total in self.sum_space:
for card in self.dealer_show_card_space:
for ace in self.ace_space:
state = (total, card, ace)
self.state_space.append(state)
for action in self.action_space:
self.Q[(state, action)] = 0.0
self.returns[(state, action)] = []
self.pairs_visited[(state, action)] = 0
def init_policy(self):
policy = {}
n = len(self.action_space)
for state in self.state_space:
policy[state] = [1 / n for _ in range(n)]
self.policy = policy
4) Action selection with the current policy. Draw an action from the \(\varepsilon\)-soft distribution.
def choose_action(self, state):
return np.random.choice(self.action_space, p=self.policy[state])
5) First-visit MC update of Q and policy. For each first visit of a state-action pair in the episode, accumulate the discounted return and average it into Q. Then make the policy \(\varepsilon\)-greedy w.r.t. the updated Q.
def update_Q(self):
for idx, (state, action, _) in enumerate(self.memory):
if self.pairs_visited[state, action] == 0:
self.pairs_visited[state, action] += 1
G = 0.0
discount = 1.0
for _, _, reward in self.memory[idx:]:
G += reward * discount
discount *= self.gamma
self.returns[state, action].append(G)
for state, action, _ in self.memory:
self.Q[state, action] = np.mean(self.returns[state, action])
self.update_policy(state)
for sa in self.pairs_visited:
self.pairs_visited[sa] = 0
self.memory = []
def update_policy(self, state):
actions = [self.Q[(state, a)] for a in self.action_space]
a_max = int(np.argmax(actions))
n_actions = len(self.action_space)
self.policy[state] = [
1 - self.eps + self.eps / n_actions if a == a_max else self.eps / n_actions
for a in self.action_space
]
5) Training loop. Roll out episodes, record (state, action, reward) in episodic memory, update Q after each episode, and track the win rate.
if __name__ == "__main__":
env = gym.make("Blackjack-v0")
agent = Agent(eps=0.001)
n_episodes = 200_000
win_lose_draw = {-1: 0, 0: 0, 1: 0}
win_rates = []
for i in range(n_episodes):
if i > 0 and i % 1000 == 0:
win_rates.append(win_lose_draw[1] / i)
if i % 50_000 == 0:
print("starting episode", i, "winrate %.3f" % (win_rates[-1] if win_rates else 0.0))
state = env.reset()
done = False
while not done:
action = agent.choose_action(state)
next_state, reward, done, _ = env.step(action)
agent.memory.append((state, action, reward))
state = next_state
agent.update_Q()
win_lose_draw[reward] += 1
plt.plot(win_rates)
plt.show()
Value prediction instead of control.
The same Monte Carlo machinery can estimate a state-value function \(V^\pi(s)\) (planning under a fixed policy) rather than a Q-function. The return from the first visit to state \(s_t\) in an episode is
$$
G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1},
$$
and the first-visit MC estimate is the sample average:
$$
V(s) \leftarrow \frac{1}{N(s)} \sum_{i=1}^{N(s)} G^{(i)}(s),
$$
or incrementally,
$$
V(s) \leftarrow V(s) + \alpha \big[G_t - V(s)\big].
$$
Below is a Blackjack example from
python_scripts/rl_4_montecarlo_prediction.py that learns \(V^\pi\) for a fixed (non-learning) policy.
1) Setup: value tables and a simple policy. We track \(V\), returns, and first-visit flags for every Blackjack state.
import numpy as np
import gym
class Agent:
def __init__(self, gamma=0.99):
self.V = {}
self.sum_space = [i for i in range(4, 22)]
self.dealer_show_card_space = [i + 1 for i in range(10)]
self.ace_space = [False, True]
self.state_space = []
self.returns = {}
self.states_visited = {}
self.memory = []
self.gamma = gamma
self.init_vals()
def init_vals(self):
for total in self.sum_space:
for card in self.dealer_show_card_space:
for ace in self.ace_space:
state = (total, card, ace)
self.V[state] = 0.0
self.returns[state] = []
self.states_visited[state] = 0
self.state_space.append(state)
def policy(self, state):
total, _, _ = state
return 0 if total >= 20 else 1
2) First-visit Monte Carlo value update. For each state's first visit, accumulate the discounted return and average it into \(V\).
def update_V(self):
for idx, (state, _) in enumerate(self.memory):
if self.states_visited[state] == 0:
self.states_visited[state] += 1
G = 0.0
discount = 1.0
for _, reward in self.memory[idx:]:
G += reward * discount
discount *= self.gamma
self.returns[state].append(G)
for state, _ in self.memory:
self.V[state] = np.mean(self.returns[state])
for state in self.state_space:
self.states_visited[state] = 0
self.memory = []
3) Training loop. Roll out many episodes under the fixed policy, then update \(V\) from the observed returns.
if __name__ == "__main__":
env = gym.make("Blackjack-v0")
agent = Agent()
n_episodes = 500_000
for i in range(n_episodes):
if i % 50_000 == 0:
print("starting episode", i)
state = env.reset()
done = False
while not done:
action = agent.policy(state)
next_state, reward, done, _ = env.step(action)
agent.memory.append((state, reward))
state = next_state
agent.update_V()
print(agent.V[21, 3, True]) # near win
print(agent.V[4, 1, False]) # poor hand vs dealer ace