Temporal-Difference Learning

In the previous entries we explored how Monte Carlo Policy Iteration was able to learn directly from raw experience without requiring a model of the environment. This made Monte Carlo the first “true” reinforcement learning method we encountered: we act, collect trajectories, then use those trajectories to improve our value estimates and our policy. Unlike Dynamic Programming methods, which required a known transition model, Monte Carlo methods only needed samples from experience. But there was a catch: Monte Carlo methods require full episodes before learning anything. We must wait until a trajectory completes, compute the return: $$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots $$ and only then update our value function.

But do we really need to wait for an entire episode?

Monte Carlo learning implicitly assumes that to get a good estimate of the long-term return, we need to look all the way to the end of the episode. But we also spent a lot of time establishing something important:

If the problem satisfies the Markov property, the next state and reward depend only on the current state and action.

If that is true (or true enough), then perhaps we can get away with not waiting until the end of the episode. Maybe we can learn during the episode — one transition at a time — instead of only after it. That idea leads to Temporal Difference (TD) learning. MarkovProperty

From Monte Carlo Updates to Temporal Difference Updates

Monte Carlo policy evaluation updated the value of a state using: $$ V(s_t) \leftarrow V(s_t) + \alpha \big[ G_t - V(s_t) \big], $$ where $G_t$ is computed using full-episode returns. But if we want to learn every time step, we need a way to replace $G_t$ with something available immediately. Note that: $$ G_t = R_{t+1} + \gamma G_{t+1}, $$ and the first part — the immediate reward — is always known after a single transition. For the rest of the return, we don’t have the true value… but we do have our current estimate of the next state’s value. Temporal Difference learning simply uses that: $$ G_t \approx R_{t+1} + \gamma V(s_{t+1}). $$ Plugging this approximation into the Monte Carlo update gives: $$ V(s_t) \leftarrow V(s_t) + \alpha \big[ R_{t+1} + \gamma V(s_{t+1}) - V(s_t) \big]. $$ The term: $$ \delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t) $$ is called the Temporal Difference error. It measures how “surprised” we are by what actually happened relative to what we previously believed. This learning procedure is called TD(0) because it uses only a single step of lookahead. It learns online, incrementally, and does not require a model. td(0)

From State-Values to Action-Values

So far we discussed updating the state value function $V(s)$. But acting optimally requires reasoning about actions, so TD learning is commonly applied to the action-value function: $$ Q(s,a). $$ Once we work with $Q(s,a)$, we can fold the policy directly into the update rule. This is where SARSA appears as a natural TD control algorithm.

SARSA: On-Policy Temporal Difference Learning

SARSA is named after the sequence it uses in its update: (State, Action, Reward, Next State, Next Action) At each time step:

We are in state $s_t$.
We pick action $a_t$ using our current (usually ε-greedy) policy.
We observe reward $R_{t+1}$ and next state $s_{t+1}$.
We pick the next action $a_{t+1}$ from the same policy.
We update $Q(s_t,a_t)$ toward the one-step TD target that uses $Q(s_{t+1}, a_{t+1})$.

The SARSA update is: $$ Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \Big[ R_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t,a_t) \Big]. $$ Because both the behavior policy (used to generate data) and the target policy (used in the update) are the same, SARSA is an on-policy method. It learns how good it is to follow its actual exploration strategy. In environments where exploratory actions can be risky, SARSA often converges to safer policies, because it “bakes in” the consequences of exploration.

Q-learning: an optimistic variant of SARSA

Q-learning has a similar structure to SARSA, but makes a key change: instead of updating toward the value of the actual next action $a_{t+1}$, it updates toward the value of the best possible next action: $$ Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \Big[ R_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t,a_t) \Big]. $$ This makes Q-learning an off-policy TD control algorithm:

The behavior policy can be ε-greedy (or anything that explores enough).
The target in the update always assumes a greedy policy in the future (via the max over actions).

Q-learning is often described as an “optimistic” version of SARSA: it updates as if the agent will behave greedily in the next state, even if exploration actually chooses something else during training. Below we walk through a complete tabular Q-learning implementation for the CartPole environment, broken into pieces to show how the code instantiates the Q-learning update.

Q-learning tutorial: CartPole control in Python

We will work with a tabular Q-learning agent. That means we need:

categorical (discrete) states,
a table $Q(s,a)$ for each discrete state–action pair,
an exploration strategy (ε-greedy), and
a training loop over many episodes.

1. Imports and entry point: how the simulation runs

Imports and __main__ block

import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np

def plot_learning_curve(scores, x):
    running_avg = np.zeros(len(scores))
    for i in range(len(running_avg)):
        running_avg[i] = np.mean(scores[max(0, i-100):(i+1)])
    plt.plot(x, running_avg)
    plt.title('Running average of previous 100 scores')
    plt.savefig('qlearning_control_cartpole.png')


if __name__ == '__main__':
    env = gym.make('CartPole-v0')
    n_games = 50000
    eps_dec = 2 / n_games

    digitizer = CartPoleStateDigitizer()
    agent = Agent(
        alpha=0.01,
        gamma=0.99,
        n_actions=2,
        epsilon=1.0,
        eps_min=0.01,
        eps_dec=eps_dec,
        state_space=digitizer.states
    )

    scores = []

    for i in range(n_games):
        observation, _ = env.reset()
        done = False
        score = 0
        state = digitizer.digitize(observation)

        while not done:
            action = agent.choose_action(state)
            observation_, reward, done, info, _ = env.step(action)
            state_ = digitizer.digitize(observation_)
            agent.learn(state, action, reward, state_)
            state = state_
            score += reward

        if i % 5000 == 0:
            print('episode', i, 'score %.1f' % score, 'epsilon %.2f' % agent.epsilon)

        agent.epsilon_decrement()
        scores.append(score)

    x = [i + 1 for i in range(n_games)]
    plot_learning_curve(scores, x)

The flow is:

Create the CartPole-v0 environment.
Set the number of training episodes (n_games = 50000).
Set the rate at which we decay ε (eps_dec), going from 1.0 down to around 0.01 over training.
Instantiate the CartPoleStateDigitizer (to discretize continuous observations).
Instantiate the Agent with learning rate alpha, discount gamma, etc.
For each episode:
- Reset the env, digitize the initial observation into a discrete state.
- Loop until episode termination:
  - Choose an action via ε-greedy policy.
  - Step the environment, get next observation and reward.
  - Digitize the new observation into next state.
  - Call agent.learn(state, action, reward, state_) to apply the Q-learning update.
  - Move to the next state and accumulate episode reward.
- Periodically print episode stats and current ε.
- Decay ε via agent.epsilon_decrement().
- Store each episode’s total score.
After training, plot the running average score over the last 100 episodes.

Everything else in the script exists to support this main loop. Next we look at how we turn continuous CartPole observations into discrete states.

2. Digitizing CartPole state: making tabular Q-learning possible

CartPole observations are continuous: $(x, \dot x, \theta, \dot\theta)$. A tabular Q-learner needs discrete state indices. The CartPoleStateDigitizer class discretizes each dimension into bins and returns a tuple of bin indices.

CartPoleStateDigitizer

class CartPoleStateDigitizer():
    def __init__(self, bounds=(2.4, 4, 0.209, 4), n_bins=10):
        self.position_space = np.linspace(-1 * bounds[0], bounds[0], n_bins)
        self.velocity_space = np.linspace(-1 * bounds[1], bounds[1], n_bins)
        self.pole_angle_space = np.linspace(-1 * bounds[2], bounds[2], n_bins)
        self.pole_velocity_space = np.linspace(-1 * bounds[3], bounds[3], n_bins)
        self.states = self.get_state_space()

    def get_state_space(self):
        states = []
        for i in range(len(self.position_space) + 1):
            for j in range(len(self.velocity_space) + 1):
                for k in range(len(self.pole_angle_space) + 1):
                    for l in range(len(self.pole_velocity_space) + 1):
                        states.append((i, j, k, l))
        return states

    def digitize(self, observation):
        x, x_dot, theta, theta_dot = observation
        cart_x = int(np.digitize(x, self.position_space))
        cart_xdot = int(np.digitize(x_dot, self.velocity_space))
        pole_theta = int(np.digitize(theta, self.pole_angle_space))
        pole_thetadot = int(np.digitize(theta_dot, self.pole_velocity_space))
        return (cart_x, cart_xdot, pole_theta, pole_thetadot)

Key points:

Bin edges: each physical quantity is sliced into n_bins equal-width buckets within reasonable bounds.
State space: self.states enumerates all tuples $(i,j,k,l)$ of bin indices; this is the discrete state space used to initialize the Q-table.
digitize: np.digitize maps each real-valued observation component to an integer bin index. The result is a categorical state suitable for tabular Q-learning.

Conceptually, we are approximating the continuous MDP with a finite MDP by clustering similar observations into the same discrete state.

3. The Agent: Q-table, ε-greedy policy, and the Q-learning update

The Agent encapsulates the Q-learning logic: it holds the Q-table, chooses actions ε-greedily, and applies the update: $$ Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \Big[ R_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t,a_t) \Big]. $$

Agent class

class Agent():
    def __init__(self, alpha, gamma, n_actions, state_space,
                 epsilon=1, eps_dec=5e-4, eps_min=0.01):
        self.alpha = alpha
        self.gamma = gamma
        self.state_space = state_space
        self.actions = [i for i in range(n_actions)]
        self.epsilon = epsilon
        self.eps_dec = eps_dec
        self.eps_min = eps_min
        self.Q = {}
        self.init_Q()

    def init_Q(self):
        for state in self.state_space:
            for action in self.actions:
                self.Q[(state, action)] = 0.0

    def max_action(self, state):
        actions = np.array([self.Q[(state, a)] for a in self.actions])
        action = np.argmax(actions)
        return action

    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.choice(self.actions)
        return self.max_action(state)

    def epsilon_decrement(self):
        if self.epsilon > self.eps_min:
            self.epsilon = self.epsilon - self.eps_dec
        else:
            self.epsilon = self.eps_min

    def learn(self, state, action, reward, state_):
        a_max = self.max_action(state_)
        q_next = reward + self.gamma * self.Q[(state_, a_max)]
        self.Q[(state, action)] += self.alpha * (q_next - self.Q[(state, action)])

Mapping this to the Q-learning equation:

Q-table: self.Q[(state, action)] represents $Q(s,a)$.
Action space: self.actions is the set of possible actions $\mathcal{A}$.
Greedy action: max_action(state) computes $\arg\max_a Q(s,a)$.
ε-greedy policy: choose_action:
- With probability ε, pick a random action (exploration).
- Otherwise, pick the greedy action (exploitation).
Annealing ε: epsilon_decrement slowly reduces ε from 1.0 down to eps_min, shifting from exploration to exploitation.
Learning step: learn:
- a_max = self.max_action(state_) computes $\arg\max_{a'} Q(s_{t+1}, a')$.
- q_next = reward + self.gamma * self.Q[(state_, a_max)] computes the target $R_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a')$.
- self.Q[(state, action)] += self.alpha * (q_next - self.Q[(state, action)]) applies the Q-learning update.

If we wanted this to implement SARSA instead, the only conceptual change would be in learn: we would pass the actual next action $a_{t+1}$ into learn and update toward $R_{t+1} + \gamma Q(s_{t+1}, a_{t+1})$ instead of using $\max_{a'} Q(s_{t+1}, a')$. This is the learning curve of the cartpole environment showing the score as a function of episodes: Q-learning CartPole Learning Curve

Summary

Monte Carlo methods learn only after full episodes; TD methods learn every time step.
SARSA is an on-policy TD control algorithm: $$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [R_{t+1} + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)].$$
Q-learning is an off-policy TD control algorithm and an “optimistic” variant: $$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [R_{t+1} + \gamma \max_{a'}Q(s_{t+1},a') - Q(s_t,a_t)].$$
Tabular Q-learning requires discrete states; we used a digitizer to approximate CartPole’s continuous dynamics with a finite MDP.
The Q-learning update is implemented in Agent.learn, directly matching the mathematical form.

From here, the natural next step is to replace the Q-table with a function approximator (e.g., a neural network), leading to Deep Q-Networks (DQN) and beyond.