Monte Carlo Methods in Reinforcement Learning

So far we relied on a dynamics model and a reward model to reason about the future. In many real tasks we do not know either: building an accurate dynamics model can be harder than finding a good policy, and reward signals are often sparse (e.g., success/failure at the end of a quadcopter trick, win/loss at the end of a game). Typical settings with unknown or costly models include:

Competitive games such as chess, backgammon, or Go.
Complex locomotion and driving tasks where dynamics are hard to write down exactly.

When the model is unknown but can be sampled through interaction, we can learn the optimal strategy instead of planning it. Monte Carlo (MC) methods do this by estimating state-action values $Q^\pi(s,a)$ from experience.

Monte Carlo intuition (π from random samples).

MC approximates a value by successive random samples. To estimate $\pi$:

Area of a square: $A_{\text{square}} = l \cdot w$.
Area of a circle: $A_{\text{circle}} = \pi r^2$.
Set $r = 0.5$, $l = w = 1.0$. Then $\frac{A_{\text{circle}}}{A_{\text{square}}} = \pi/4$, so $\pi = 4 \cdot \frac{A_{\text{circle}}}{A_{\text{square}}}$.

If we uniformly sample points in the square, the fraction that land inside the circle converges to $\pi/4$. The same averaging idea underpins MC value estimation.

Monte Carlo policy evaluation (for $Q^\pi$).

We initialize a policy $\pi$ (often $\varepsilon$-soft). Run an episode: $\pi: S_0, A_0, R_1, \dots, S_{T-1}, A_{T-1}, R_T$ Imagine a grid-world where states are angle and angular rate (these could represent anything in practice). We play an episode and obtain a trajectory (dark to pale over time). Gridworld trajectory over time

Work backward from the terminal state using returns (the return for a state-action pair is the discounted sum of rewards after taking that pair): $$ G_t(s_t, a_t) = R_{t+1} + \gamma G_{t+1}, $$ with $G_{t+1}$ the accumulated return beyond the next step. Example along a path: $$ Q_\pi(\{6,7\}, A_{T-1}) = 5 + 0 = 5, \qquad Q_\pi(\{5,7\}, A_{T-2}) = 5 + \gamma \cdot 5, \dots $$ Gridworld trajectory with value heatmap

Unifying samples across episodes.

For each state-action pair, store the list of observed returns and take the average: $$ Q(S_t, A_t) = \text{Average}(G_t). $$

Memory considerations.

$N$ episodes of length $M$ imply $N \times M$ stored returns. To avoid unbounded memory, use an incremental mean: $$ Q(s_t, a_t) = \frac{(N-1) \cdot Q_{\text{old}}(s_t, a_t)}{N} + \frac{G_t}{N}, $$ which only needs a running mean and a visit count per state-action pair.

Fixed step size (forgetting factor).

Recent returns are often more informative than early ones. Replace $1/N$ with a constant step size $\alpha$: $$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[G_t - Q(s_t, a_t)\big]. $$ This reduces training time by preventing early noisy returns from anchoring estimates.

Monte Carlo policy iteration (on $Q$).

Combine Q-updates with policy improvement:

Policy evaluation: update $Q$ using the returns from the episode.
Policy improvement: $\pi(a \mid s) = \arg\max_a Q(s,a)$.

Greedy-only improvement can ignore unvisited but promising actions, so we soften the policy.

$\varepsilon$-soft policy.

$$ \pi(a \mid S_t) \leftarrow \begin{cases} 1 - \varepsilon + \varepsilon / |A(S_t)|, & a = A^* \\ \varepsilon / |A(S_t)|, & a \neq A^* \end{cases} $$ This keeps exploring while favoring the current best action. In simulation you can also use exploring starts (sample random initial state-action pairs) to improve coverage. Empirically, as $N \to \infty$, $Q_\pi(s,a) \to Q_{\pi^*}(s,a)$ and $\pi(a \mid s) \to \pi^*(a \mid s)$, though a full proof is still elusive. Monte Carlo Algorithm

Properties and caveats.

First technique in this series that learns directly from experience (no model required).
Updates move backward from the episode's end, one step at a time.
Sample inefficient: needs full episodes and only updates visited state-action pairs.
Ambiguity can persist about which actions caused success (e.g., two pendulum trajectories ending at the same setpoint).

Worked example: Monte Carlo control on Blackjack (Gym).

Below is a walk-through of python_scripts/rl_4_montecarlo_control.py, which learns $Q(s,a)$ for the Blackjack-v0 environment using first-visit MC control with an $\varepsilon$-soft policy. 1) Imports and entry point. Main sets up the environment, agent, training loop, and win-rate tracking.

import numpy as np
import gym
import matplotlib.pyplot as plt

if __name__ == "__main__":
    env = gym.make("Blackjack-v0")
    agent = Agent(eps=0.001)
    n_episodes = 200_000
    win_lose_draw = {-1: 0, 0: 0, 1: 0}
    win_rates = []

    for i in range(n_episodes):
        if i > 0 and i % 1000 == 0:
            win_rates.append(win_lose_draw[1] / i)
        if i % 50_000 == 0:
            print("starting episode", i, "winrate %.3f" % (win_rates[-1] if win_rates else 0.0))

        state = env.reset()
        done = False
        while not done:
            action = agent.choose_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.memory.append((state, action, reward))
            state = next_state

        agent.update_Q()
        win_lose_draw[reward] += 1

    plt.plot(win_rates)
    plt.show()

2) Agent skeleton and stored data. The agent maintains $Q$, per-visit returns, a first-visit flag per state-action, and episodic memory of (state, action, reward) tuples.

class Agent:
    def __init__(self, eps=0.1, gamma=0.99):
        self.Q = {}
        self.sum_space = [i for i in range(4, 22)]
        self.dealer_show_card_space = [i + 1 for i in range(10)]
        self.ace_space = [False, True]
        self.action_space = [0, 1]  # stick or hit
        self.state_space = []
        self.returns = {}
        self.pairs_visited = {}
        self.memory = []
        self.gamma = gamma
        self.eps = eps
        self.init_vals()
        self.init_policy()

3) Initialize Q and an $\varepsilon$-soft policy. Enumerate every state (player sum, dealer showing, usable ace) and every action, seed Q to 0, and start with a uniform policy.

    def init_vals(self):
        for total in self.sum_space:
            for card in self.dealer_show_card_space:
                for ace in self.ace_space:
                    state = (total, card, ace)
                    self.state_space.append(state)
                    for action in self.action_space:
                        self.Q[(state, action)] = 0.0
                        self.returns[(state, action)] = []
                        self.pairs_visited[(state, action)] = 0

    def init_policy(self):
        policy = {}
        n = len(self.action_space)
        for state in self.state_space:
            policy[state] = [1 / n for _ in range(n)]
        self.policy = policy

4) Action selection with the current policy. Draw an action from the $\varepsilon$-soft distribution.

    def choose_action(self, state):
        return np.random.choice(self.action_space, p=self.policy[state])

5) First-visit MC update of Q and policy. For each first visit of a state-action pair in the episode, accumulate the discounted return and average it into Q. Then make the policy $\varepsilon$-greedy w.r.t. the updated Q.

    def update_Q(self):
        for idx, (state, action, _) in enumerate(self.memory):
            if self.pairs_visited[state, action] == 0:
                self.pairs_visited[state, action] += 1
                G = 0.0
                discount = 1.0
                for _, _, reward in self.memory[idx:]:
                    G += reward * discount
                    discount *= self.gamma
                    self.returns[state, action].append(G)

        for state, action, _ in self.memory:
            self.Q[state, action] = np.mean(self.returns[state, action])
            self.update_policy(state)

        for sa in self.pairs_visited:
            self.pairs_visited[sa] = 0
        self.memory = []

    def update_policy(self, state):
        actions = [self.Q[(state, a)] for a in self.action_space]
        a_max = int(np.argmax(actions))
        n_actions = len(self.action_space)
        self.policy[state] = [
            1 - self.eps + self.eps / n_actions if a == a_max else self.eps / n_actions
            for a in self.action_space
        ]

5) Training loop. Roll out episodes, record (state, action, reward) in episodic memory, update Q after each episode, and track the win rate.

if __name__ == "__main__":
    env = gym.make("Blackjack-v0")
    agent = Agent(eps=0.001)
    n_episodes = 200_000
    win_lose_draw = {-1: 0, 0: 0, 1: 0}
    win_rates = []

    for i in range(n_episodes):
        if i > 0 and i % 1000 == 0:
            win_rates.append(win_lose_draw[1] / i)
        if i % 50_000 == 0:
            print("starting episode", i, "winrate %.3f" % (win_rates[-1] if win_rates else 0.0))

        state = env.reset()
        done = False
        while not done:
            action = agent.choose_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.memory.append((state, action, reward))
            state = next_state

        agent.update_Q()
        win_lose_draw[reward] += 1

    plt.plot(win_rates)
    plt.show()

Value prediction instead of control.

The same Monte Carlo machinery can estimate a state-value function $V^\pi(s)$ (planning under a fixed policy) rather than a Q-function. The return from the first visit to state $s_t$ in an episode is $$ G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1}, $$ and the first-visit MC estimate is the sample average: $$ V(s) \leftarrow \frac{1}{N(s)} \sum_{i=1}^{N(s)} G^{(i)}(s), $$ or incrementally, $$ V(s) \leftarrow V(s) + \alpha \big[G_t - V(s)\big]. $$ Below is a Blackjack example from python_scripts/rl_4_montecarlo_prediction.py that learns $V^\pi$ for a fixed (non-learning) policy. 1) Setup: value tables and a simple policy. We track $V$, returns, and first-visit flags for every Blackjack state.

import numpy as np
import gym

class Agent:
    def __init__(self, gamma=0.99):
        self.V = {}
        self.sum_space = [i for i in range(4, 22)]
        self.dealer_show_card_space = [i + 1 for i in range(10)]
        self.ace_space = [False, True]
        self.state_space = []
        self.returns = {}
        self.states_visited = {}
        self.memory = []
        self.gamma = gamma
        self.init_vals()

    def init_vals(self):
        for total in self.sum_space:
            for card in self.dealer_show_card_space:
                for ace in self.ace_space:
                    state = (total, card, ace)
                    self.V[state] = 0.0
                    self.returns[state] = []
                    self.states_visited[state] = 0
                    self.state_space.append(state)

    def policy(self, state):
        total, _, _ = state
        return 0 if total >= 20 else 1

2) First-visit Monte Carlo value update. For each state's first visit, accumulate the discounted return and average it into $V$.

    def update_V(self):
        for idx, (state, _) in enumerate(self.memory):
            if self.states_visited[state] == 0:
                self.states_visited[state] += 1
                G = 0.0
                discount = 1.0
                for _, reward in self.memory[idx:]:
                    G += reward * discount
                    discount *= self.gamma
                    self.returns[state].append(G)

        for state, _ in self.memory:
            self.V[state] = np.mean(self.returns[state])

        for state in self.state_space:
            self.states_visited[state] = 0
        self.memory = []

3) Training loop. Roll out many episodes under the fixed policy, then update $V$ from the observed returns.

if __name__ == "__main__":
    env = gym.make("Blackjack-v0")
    agent = Agent()
    n_episodes = 500_000

    for i in range(n_episodes):
        if i % 50_000 == 0:
            print("starting episode", i)

        state = env.reset()
        done = False
        while not done:
            action = agent.policy(state)
            next_state, reward, done, _ = env.step(action)
            agent.memory.append((state, reward))
            state = next_state

        agent.update_V()

    print(agent.V[21, 3, True])   # near win
    print(agent.V[4, 1, False])   # poor hand vs dealer ace