The Bellman Equation

Bandits capture exploration vs exploitation when actions do not affect the future. “True” reinforcement learning begins when actions influence what state you end up in next. The standard mathematical model for this is the Markov Decision Process (MDP), and the central organizing principle is the Bellman equation.

Markov Decision Processes.

An MDP is a 5-tuple: $$ \mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma) $$

$\mathcal{S}$: state space
$\mathcal{A}$: action space
$P(s' \mid s,a)$: transition model (probability of next state $s'$)
$R(s,a)$: expected immediate reward
$\gamma \in [0,1)$: discount factor

The defining assumption is the Markov property: conditioned on the current state and action, the next state is independent of the past. $$ P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots) = P(s_{t+1} \mid s_t, a_t) $$ The discount factor $\gamma$ controls how much we care about the future. When $\gamma$ is close to 0, the agent is myopic; when $\gamma$ is close to 1, long-horizon consequences matter strongly. In infinite-horizon settings, $\gamma<1$ also ensures the infinite sum of rewards converges.

Trajectories and return.

Under a policy $\pi$, the interaction produces a trajectory $(s_0,a_0,r_0,s_1,a_1,r_1,\dots)$. The discounted return from time $t$ is: $$ G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}. $$ Reinforcement learning is typically framed as finding a policy that maximizes expected return.

Policies.

A (stochastic) policy is a conditional distribution over actions given a state: $$ \pi(a\mid s) = P(a_t=a \mid s_t=s). $$ A deterministic policy is a special case where $\pi(a\mid s)$ puts all probability mass on one action, often written as $\pi(s)\in\mathcal{A}$.

Value functions

The state-value function under policy $\pi$ is: $$ V^\pi(s) = \mathbb{E}_\pi \left[G_0 \mid s_0=s\right] = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty}\gamma^t r_t \,\middle|\, s_0=s\right]. $$ The action-value function (often more directly useful in control) is: $$ Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty}\gamma^t r_t \,\middle|\, s_0=s, a_0=a\right]. $$ The optimal value functions are: $$ V^*(s)=\max_\pi V^\pi(s), \qquad Q^*(s,a)=\max_\pi Q^\pi(s,a). $$ And an optimal policy can be recovered by acting greedily with respect to $Q^*$: $$ \pi^*(s)\in \arg\max_a Q^*(s,a). $$

Bellman expectation equations.

The key idea is that value functions satisfy a one-step recursion. For a fixed policy $\pi$, we can “peel off” the first reward and then the rest is the value of the next state: $$ V^\pi(s) = \sum_a \pi(a\mid s)\Bigg[ R(s,a) + \gamma \sum_{s'} P(s'\mid s,a)V^\pi(s') \Bigg]. $$ Similarly: $$ Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'\mid s,a)\sum_{a'}\pi(a'\mid s')Q^\pi(s',a'). $$ It is often useful to note the relationship between $V^\pi$ and $Q^\pi$: $$ V^\pi(s)=\sum_a \pi(a\mid s)Q^\pi(s,a). $$

Bellman optimality equations.

If we want the best possible policy, we can replace the policy expectation with a max: $$ V^*(s)=\max_a \left[ R(s,a) + \gamma \sum_{s'}P(s'\mid s,a)V^*(s') \right]. $$ And: $$ Q^*(s,a)=R(s,a)+\gamma\sum_{s'}P(s'\mid s,a)\max_{a'}Q^*(s',a'). $$ These equations define the optimal solution as a fixed point. The rest of RL is about computing or approximating that fixed point.

Bellman operators and convergence (why iteration works).

Define the policy evaluation operator $T^\pi$: $$ (T^\pi V)(s) = \sum_a \pi(a\mid s)\left[R(s,a)+\gamma\sum_{s'}P(s'\mid s,a)V(s')\right]. $$ And the optimality operator $T^*$: $$ (T^* V)(s)=\max_a\left[R(s,a)+\gamma\sum_{s'}P(s'\mid s,a)V(s')\right]. $$ A crucial fact: for $\gamma\in[0,1)$, both operators are $\gamma$-contractions under the sup norm: $$ \|T V - T W\|_\infty \le \gamma \|V-W\|_\infty. $$ This implies:

Each operator has a unique fixed point: $V^\pi$ for $T^\pi$, and $V^*$ for $T^*$.
Repeated application converges: $V_{k+1}=T^\pi V_k$ converges to $V^\pi$, and $V_{k+1}=T^* V_k$ converges to $V^*$.

This contraction property is one of the main reasons discounted infinite-horizon MDPs are mathematically well-behaved.

Dynamic programming teaser.

If the transition model $P$ and reward function $R$ are known, we can solve the MDP with dynamic programming:

Policy evaluation: compute $V^\pi$ (a prediction problem).
Policy improvement: make the policy greedy with respect to $V^\pi$ (a control step).
Alternating these yields policy iteration.
Applying $T^*$ repeatedly yields value iteration.

Worked MDP example: small stochastic grid.

We will use a 3-state MDP:

State 0: terminal “bad” (reward 0)
State 1: decision state
State 2: terminal “good” (reward +1)

From state 1, you can choose actions left or right. The outcome is slightly stochastic: with probability 0.9 you go the intended direction, and with probability 0.1 you slip to the opposite terminal. This makes the Bellman expectations non-trivial.


import numpy as np

states = [0, 1, 2]
actions = ["left", "right"]
gamma = 0.95

# Transition model P[s, a] -> list of (prob, s_next, reward)
# Terminal states self-loop with zero reward.
P = {
    0: {"left": [(1.0, 0, 0.0)], "right": [(1.0, 0, 0.0)]},
    2: {"left": [(1.0, 2, 0.0)], "right": [(1.0, 2, 0.0)]},
    1: {
        "left":  [(0.9, 0, 0.0), (0.1, 2, 1.0)],
        "right": [(0.9, 2, 1.0), (0.1, 0, 0.0)]
    }
}

def bellman_backup(V, s, a):
    # Computes: R(s,a) + gamma * sum_{s'} P(s'|s,a) V(s')
    return sum(prob * (r + gamma * V[s_next]) for (prob, s_next, r) in P[s][a])

1) Value iteration.

Value iteration applies the Bellman optimality operator $T^*$ repeatedly: $$ V_{k+1}(s) \leftarrow \max_a \left[ R(s,a) + \gamma \sum_{s'}P(s'\mid s,a)V_k(s') \right]. $$


V = np.zeros(len(states))

for k in range(50):
    V_new = V.copy()

    # Only state 1 is non-terminal in this toy MDP
    s = 1
    V_new[s] = max(bellman_backup(V, s, a) for a in actions)

    # Check convergence (optional)
    if np.max(np.abs(V_new - V)) < 1e-10:
        V = V_new
        break

    V = V_new

# Extract greedy policy from V
q_left = bellman_backup(V, 1, "left")
q_right = bellman_backup(V, 1, "right")
pi_star = "right" if q_right > q_left else "left"

print("V*:", V)
print("Greedy policy at state 1:", pi_star)
print("Q*(1,left), Q*(1,right):", q_left, q_right)

Interpretation: $V^*(1)$ is the expected discounted return from the decision state when acting optimally. Because the reward is obtained upon entering the “good” terminal, $V^*(1)$ reflects both the success probability and the discounting.

2) Policy iteration (in depth).

Policy iteration alternates:

Policy evaluation: solve $V^\pi = T^\pi V^\pi$.
Policy improvement: update $\pi(s)\in\arg\max_a \left[R(s,a)+\gamma\sum_{s'}P(s'|s,a)V^\pi(s')\right]$.

Policy evaluation can be done by iterative backups (like above), or by solving a linear system. In finite MDPs, the Bellman expectation equation can be written: $$ V^\pi = R^\pi + \gamma P^\pi V^\pi \quad\Rightarrow\quad (I - \gamma P^\pi) V^\pi = R^\pi. $$ Below is a compact policy-iteration implementation using iterative policy evaluation (keeps the code readable and mirrors how it’s taught).


# Initialize a deterministic policy at state 1
policy = {1: "left"}
V = np.zeros(len(states))

def policy_evaluation(policy, V, iters=50):
    # Iterative evaluation: V <- T^pi V
    for _ in range(iters):
        V_new = V.copy()
        s = 1
        a = policy[s]
        V_new[s] = bellman_backup(V, s, a)
        V = V_new
    return V

def policy_improvement(V):
    # Greedy improvement at state 1
    q = {a: bellman_backup(V, 1, a) for a in actions}
    return "right" if q["right"] > q["left"] else "left"

for _ in range(10):
    V = policy_evaluation(policy, V, iters=50)
    new_action = policy_improvement(V)

    if new_action == policy[1]:
        break
    policy[1] = new_action

print("Policy-iteration policy:", policy)
print("Value under policy:", V)

Policy iteration often converges in very few outer loops because each improvement step makes the policy strictly better unless it is already optimal. Conceptually, this is an instance of Generalized Policy Iteration (GPI): alternate between improving the value estimate and improving the policy.

Value iteration vs policy iteration.

Value iteration performs “partial policy evaluation” implicitly by applying $T^*$ directly. It is simple and guaranteed to converge, but can require many iterations for tight tolerance.
Policy iteration evaluates the current policy more fully, then improves it. It often converges in fewer outer loops, but each evaluation step can be more expensive.

Where model-free RL fits.

Everything above assumed we know $P$ and $R$. In most RL problems we do not. Model-free methods replace expectations like $\sum_{s'}P(s'|s,a)V(s')$ with samples from experience, yielding updates such as: $$ V(s_t) \leftarrow V(s_t) + \alpha \left(r_t + \gamma V(s_{t+1}) - V(s_t)\right) $$ That is a temporal-difference (TD) update, which can be read as a stochastic approximation to a Bellman backup. Once the Bellman picture is clear, many RL algorithms become “different ways of approximating Bellman consistency.”