Value-based RL: From Value Iteration to DQN

A summary of the essential ideas behind value-based RL.

RL Statistics
7 min read

Background

The Reinforcement Learning Problem

Reinforcement learning considers how an agent should act in an environment to achieve a certain goal. Much of the notation and structure of this post is derived from Peter Henderson’s COS 435. We define the environment as a Markov Decision Process (MDP): a tuple (S,A,T,R,γ)(\mathcal{S}, \mathcal{A}, \mathcal{T}, R, \gamma), where

  • S\mathcal{S} is the state space
  • A\mathcal{A} is the action space
  • T:S×AΔ(S)\mathcal{T}: \mathcal{S} \times \mathcal{A} \to \Delta(\mathcal{S}) is the transition kernel, where T(ss,a)=P(st+1=sst=s,at=a)\mathcal{T}(s' \mid s, a) = P(s_{t+1} = s' \mid s_t = s, a_t = a)
  • R:S×ARR: \mathcal{S} \times \mathcal{A} \to \mathbb{R} is the expected immediate reward function, where R(s,a)=E[rtst=s,at=a]R(s, a) = \mathbb{E}[r_t \mid s_t = s, a_t = a]
  • γ[0,1)\gamma \in [0, 1) is the discount factor

The agent’s goal is to find the optimal policy π:SA\pi^*: \mathcal{S} \to \mathcal{A} that maximizes the expected discounted reward:

π=argmaxπEπ[t=0γtrts0]\pi^* = \arg \max_{\pi} \mathbb{E}_{\pi} \left[ \sum_{t=0}^\infty \gamma^t r_t \mid s_0 \right]

where rtr_t is the sampled reward received at time tt. To simplify notation, we define the return GtG_t as the discounted sum of rewards:

Gt=k=0γkrt+k.G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}.

Then, the optimal policy is

π=argmaxπEπ[G0s0]\pi^* = \arg \max_{\pi} \mathbb{E}_{\pi} \left[ G_0 \mid s_0 \right]

Value Functions

The value function is the expected return given that we start at state ss and follow policy π\pi:

Vπ(s)=Eπ[Gtst=s].V^\pi(s) = \mathbb{E}_{\pi} \left[ G_t \mid s_t = s \right].

The optimal value function is the best expected return we can get from any policy:

V(s)=maxπVπ(s).V^*(s) = \max_{\pi} V^\pi(s).

Similarly, the action-value function is the expected return given that we start at state ss, take action aa, and follow policy π\pi:

Qπ(s,a)=Eπ[Gtst=s,at=a].Q^\pi(s, a) = \mathbb{E}_{\pi} \left[ G_t \mid s_t = s, a_t = a \right].

The optimal action-value function is the best expected return we can get from any policy, starting at state ss and taking action aa:

Q(s,a)=maxπQπ(s,a).Q^*(s, a) = \max_{\pi} Q^\pi(s, a).

Model-based vs. Model-free RL

A brief aside to frame the following discussion: RL methods are often split into model-based and model-free approaches. Model-based RL learns (or assumes access to) the environment dynamics (transition probabilities T\mathcal{T} and reward function RR), while model-free RL learns value functions or policies directly from experience (no explicit environment model).

Value Iteration

Let’s assume we know the environment dynamics. How can we recover the optimal policy?

Bellman Equations

The Bellman optimality equations say that the optimal value function V(s)V^*(s) must satisfy:

V(s)=maxa[R(s,a)+γEsT(s,a)V(s)].V^*(s) = \max_a \left[R(s, a) + \gamma \mathbb{E}_{s' \sim \mathcal{T}(\cdot \mid s, a)} V^*(s')\right].

For the optimal action-value function Q(s,a)Q^*(s, a),

Q(s,a)=R(s,a)+γEsT(s,a)[maxaQ(s,a)].Q^*(s, a) = R(s, a) + \gamma \mathbb{E}_{s' \sim \mathcal{T}(\cdot \mid s, a)} \left[\max_{a'} Q^*(s', a')\right].

The Bellman optimality operator B\mathcal{B} maps any value function VV to

(BV)(s)=maxa[R(s,a)+γEsT(s,a)V(s)].(\mathcal{B}V)(s) = \max_a [R(s, a) + \gamma \mathbb{E}_{s' \sim \mathcal{T}(\cdot \mid s, a)} V(s')].

Notice that the Bellman optimality equation is simply requiring that the optimal value function is a fixed point of the Bellman operator:

V=BV.V^* = \mathcal{B}V^*.

To measure how good a candidate value function VV is, we use the Bellman error (also called the Bellman residual), i.e., how far VV is from being a fixed point of a Bellman operator. The Bellman error is then

δB(s)=(BV)(s)V(s).\delta_B(s) = (\mathcal{B}V)(s) - V(s).

Value Iteration Algorithm

A natural idea arises out of the Bellman operator. We can initialize the value function for all states to some value, then repeatedly apply the Bellman operator until we get the optimal value function:

Initialize V(s) = 0 for all s in S
repeat
    Δ <- 0
    for each state s in S:
        v <- V(s)
        V(s) <- max_a [R(s, a) + γ * Σ_{s' in S} T(s' | s, a) V(s')]
        Δ <- max(Δ, |v - V(s)|)
until Δ < ε
return V

To better understand value iteration, one should walk through a grid world example. The textbooks in Sources walk through grid world examples.

Policy Extraction

Given the optimal value function VV^*, extracting the optimal policy is simple:

π(s)=argmaxa[R(s,a)+γEsT(s,a)V(s)]\pi^*(s) = \arg \max_a \left[R(s, a) + \gamma \mathbb{E}_{s' \sim \mathcal{T}(\cdot \mid s, a)} V^*(s')\right]

Framing Value Iteration

Value iteration is relatively simple and has convergence guarantees, but there are a few issues:

  1. We often don’t know the environment dynamics.
  2. State spaces can be large and thus make summing over all states computationally infeasible.

Learning Without a Model

Monte Carlo Control

Monte Carlo methods use random sampling to estimate certain numbers. In our case, we can use rollouts to estimate values to avoid explicitly computing Bellman updates.

Value iteration becomes a lot more simple with Monte Carlo updates. For each iteration, we sample a trajectory from the environment starting at some state ss and following policy π\pi. Then, we update the value function for each state in the trajectory:

V(st)V(st)+η[GtV(st)],V(s_t) \leftarrow V(s_t) + \eta \left[G_t - V(s_t)\right],

where η\eta is the learning rate and GtG_t is the sampled return from time tt.

For Monte Carlo control, we apply the same update idea to Q-functions:

Q(st,at)Q(st,at)+η[GtQ(st,at)],Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \eta \left[G_t - Q(s_t, a_t)\right],

Then, we improve the policy greedily:

πk+1(s)=argmaxaQk(s,a).\pi_{k+1}(s) = \arg\max_a Q_k(s, a).

We repeat these two steps until the policy converges.

Since Monte Carlo control uses rollouts, we must follow a policy during rollouts (unlike in value iteration). If our policy doesn’t improve, then our Q-function estimates also won’t improve. This is why Monte Carlo control requires us to alternate between policy evaluation (updating the Q-function estimates) and policy improvement (improving the policy). Monte Carlo control is an instantiation of policy iteration.

Temporal Difference Learning

A major issue with Monte Carlo control is that we must obtain an entire trajectory before performing updates. For policy evaluation under π\pi, the ideal Bellman-style update is

V(st)V(st)+η[(BπV)(st)V(st)]V(s_t) \leftarrow V(s_t) + \eta \left[(\mathcal{B}^\pi V)(s_t) - V(s_t)\right]

where

(BπV)(st)=Eπ[rt+1+γV(st+1)st].(\mathcal{B}^\pi V)(s_t) = \mathbb{E}_{\pi}[r_{t+1} + \gamma V(s_{t+1}) \mid s_t].

Instead, we can approximate the Bellman update using one sample’s transition to the next state st+1s_{t+1}:

(BπV)(st)rt+1+γV(st+1).(\mathcal{B}^\pi V)(s_t) \approx r_{t+1} + \gamma V(s_{t+1}).

Plugging this approximation into the Bellman update gives the TD(0) update:

V(st)V(st)+η[rt+1+γV(st+1)V(st)].V(s_t) \leftarrow V(s_t) + \eta \left[r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\right].

Notice that TD learning relies on bootstrapping, the idea that we can use a guess to make a better guess. The term δt:=rt+1+γV(st+1)V(st)\delta_t := r_{t+1} + \gamma V(s_{t+1}) - V(s_t) is called the TD error.

TD(0) is an unbiased estimator For some quantity XX, an estimator X^\hat{X} is unbiased if E[X^]=X\mathbb{E}[\hat{X}] = X. of the Bellman error (conditioned on the current state):

E[δtst]=E[rt+1+γV(st+1)V(st)st]=E[rt+1+γV(st+1)st]V(st)=(BπV)(st)V(st)=δBπ(st).\begin{aligned} \mathbb{E}[\delta_t \mid s_t] &= \mathbb{E}[r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \mid s_t] \\ &= \mathbb{E}[r_{t+1} + \gamma V(s_{t+1}) \mid s_t] - V(s_t) \\ &= (\mathcal{B}^\pi V)(s_t) - V(s_t) \\ &= \delta_B^\pi(s_t). \end{aligned}

However, TD(0) suffers from high variance. To reduce variance, we can change the number of steps nn we look ahead to get n-step TD updates:

V(st)V(st)+η[(k=0n1γkrt+k+1+γnV(st+n))V(st)]V(s_t) \leftarrow V(s_t) + \eta \left[\left(\sum_{k=0}^{n-1} \gamma^k r_{t+k+1} + \gamma^n V(s_{t+n})\right) - V(s_t)\right]

Bias-Variance Tradeoff

Q-Learning

Deep Q-Networks

Sources

  1. COS 435: Reinforcement Learning by Peter Henderson
  2. Reinforcement Learning: An Overview by Kevin Murphy
  3. Reinforcement Learning: An Introduction by Sutton and Barto