Dynamic Programming for Reinforcement Learning

In the last post of this series, we introduced the Bellman equations as the core conceptual tool for reasoning about value in reinforcement learning (RL). The Bellman equation operates on the premise that a state’s value can be decomposed into immediate reward plus the value of subsequent states. This recursive structure is powerful, but it also raises a critical computational question: how do we efficiently compute these values without getting bogged down in an exponential number of future possibilities? In early reinforcement learning problems, the most tempting idea is also the most dangerous: if we want the best behavior for the next $N$ steps, why not just look ahead at everything that could happen, score each possible future, and pick the best? This is the “brute force value estimation” mindset. It’s conceptually clean. It also explodes computationally. Let's first explore brute force search and its pitfalls, then introduce dynamic programming (DP) as a principled way to sidestep the combinatorial explosion.

RL interaction loop

At time $t$, the agent chooses an action $a_t$ from a policy $\pi$, the environment transitions, and the agent receives a reward $R_t$. This is the basic feedback loop that everything else builds on, brute force search or other strategies rely on this Markov Decision Process (MDP) structure. We can describe how an intelligent agent may interact with its environment over discrete time steps with the below image.

From this starting point, we can formalize the agent's decision-making process and the environment's response as follows: $$ a_t = \arg\max_{a\in\mathcal{A}} \pi(a \mid o_{t-1}), \qquad s_{t+1} = f(s_t, a_t). $$ You can read $f(\cdot)$ as “the dynamics model.” In model-based settings we assume we know it (or can simulate it). In model-free settings we do not.

The Bellman Equation: value is “reward now + value later”

A compact value recursion you’ll see early is: $$ V(s_i) = R_i + \gamma V(s_{i+1}). $$ This is a simple way to internalize the core idea: the value of being in a state is the immediate reward plus a discounted version of what comes next. The “full” Bellman expectation equation makes the probabilities explicit: $$ V^\pi(s)=\sum_a \pi(a\mid s)\sum_{s',r} p(s',r\mid s,a)\left[r + \gamma V^\pi(s')\right]. $$ This full Bellman equation makes it more clear how the value for a state is computed. In its original format, as displayed above, it relies on knowing the outcomes you would get from taking all possible actions in that state, weighted by the policy's action probabilities and the environment's transition probabilities. This means that to have a true value estimate, you'd need to know the result of going down all forks in the road.

Brute force search for $V^\pi(s)$

Brute force mindset. If you want a policy that does the best actions over the next $N$ steps: just look ahead over all possibilities for those $N$ steps and choose the best one. Even the one-step picture is intuitive: you start at a state, branch over actions, branch to the next states, and back up the best outcome to define $V(s)$. Brute Force 1

Where it breaks. As soon as you look ahead more than one step, the tree balloons. Each extra timestep multiplies the number of future branches you must consider. Brute Force 2

The important scaling facts are: - it scales multiplicatively with the number of actions,
- it scales multiplicatively with the number of states, and
- it scales exponentially with how many timesteps you look ahead. To brute-force a Bellman-style lookahead, you typically need: 1) a dynamical model that predicts $s'$ and reward $r$;
2) the Markov property (next state depends only on current state and action);
3) enough compute to actually search the tree. If we denote: - $N$: planning horizon (timesteps looked ahead),
- $n_x$: number of states (or discretized states),
- $n_u$: number of actions, then the brute-force optimization cost is often summarized as: $$ \mathcal{O}\big(n_x (n_x n_u)^N\big). $$ A toy number from the lecture story: look ahead 10 timepoints with 10 states and 3 actions. That gives $10 \cdot 30^{10} \approx 5.9\times 10^{15}$ possibilities. This is already dead-on-arrival.

This is the core motivation: “true value estimation” by explicitly enumerating futures becomes infeasible extremely quickly. So the real question is: can we keep the Bellman logic without paying exponential time in $N$?

Dynamic programming: break the problem into subproblems

Dynamic programming changes the computation, not the goal. Instead of searching all length-$N$ futures, DP repeatedly solves a simpler subproblem: “Given my current estimate of the future, what is the best one-step backup for each state?” Then it repeats those backups until the values stabilize. This is the first major pattern you see in reinforcement learning: alternate between improving values and improving the policy. Markov Decision Process

This alternating pattern is called generalized policy iteration (GPI): - Policy evaluation: update $V$ for a fixed $\pi$.
- Policy improvement: update $\pi$ to be greedier w.r.t. the latest $V$.

DP is “early RL” largely because it is clean and principled under strong assumptions: the model is known (or you have a perfect simulator), and you can enumerate states/actions (tabular setting).

Policy evaluation (Bellman expectation backup)

Policy Evaluation. If the policy $\pi$ is fixed, we update value estimates by applying a Bellman backup: $$ V_{k+1}(s)=\sum_a \pi(a\mid s)\sum_{s',r} p(s',r\mid s,a)\left[r + \gamma V_k(s')\right]. $$ Intuition: you’re averaging over all actions the policy might take and all outcomes the environment might produce. This is where knowing the model matters: DP is literally computing the expectation.

Policy Improvement

Once you have a decent estimate of value, improving the policy becomes the straightforward step: for each state, pick the action with the best expected one-step outcome plus downstream value. $$ \pi_k(s)=\arg\max_{a\in\mathcal{A}} \sum_{s'} p(s'\mid s,a)\left[r + V_k(s')\right]. $$ Narratively: policy evaluation answers “how good is each state under my current behavior,” and policy improvement answers “given that, what should I do next time I’m in this state?”

Stopping Criterion

The algorithmic picture from the lecture is: initialize values (often via terminal costs), repeat backups and greedy improvements, stop when the values stabilize. DP Loop

A typical convergence criterion is: $$ |V_k(s) - V_{k-1}(s)| < \varepsilon \qquad \forall s\in\mathcal{S}. $$ The key computational win vs brute force: DP avoids exponential dependence on horizon $N$ by reusing “value of subproblems.” It can still be expensive in the number of states (curse of dimensionality), but it’s no longer hopeless for small tabular models.

Example: Inverted Pendulum

You can consult the code here:
rl_3_dynamicprogramming.py

Quiz questions

Q: Brute force scales exponentially with what?
Q: Dynamic programming scales exponentially with what?
Q: What do we need to use dynamic programming?
Q: Dynamic programming is an application of GPI.
Q: What is the purpose of policy evaluation?
Q: What is the purpose of policy improvement?

Where this goes next

Dynamic programming is model-based. That’s both its strength and its limitation. If you don’t know the dynamics model, you can’t take expectations (or simulate reliably) during backups. The next step is to estimate value from experience rather than from a model: Monte Carlo methods use trajectories, then backtrack through visited states using Bellman-style reasoning.

Supplemental resource

Richard Sutton & Andrew Barto — Reinforcement Learning: An Introduction, Chapter 4.