Value-based RL: From Value Iteration to DQN
A summary of the essential ideas behind value-based RL.
Background
The Reinforcement Learning Problem
Reinforcement learning considers how an agent should act in an environment to achieve a certain goal. Much of the notation and structure of this post is derived from Peter Henderson’s COS 435. We define the environment as a Markov Decision Process (MDP): a tuple , where
- is the state space
- is the action space
- is the transition kernel, where
- is the expected immediate reward function, where
- is the discount factor
The agent’s goal is to find the optimal policy that maximizes the expected discounted reward:
where is the sampled reward received at time . To simplify notation, we define the return as the discounted sum of rewards:
Then, the optimal policy is
Value Functions
The value function is the expected return given that we start at state and follow policy :
The optimal value function is the best expected return we can get from any policy:
Similarly, the action-value function is the expected return given that we start at state , take action , and follow policy :
The optimal action-value function is the best expected return we can get from any policy, starting at state and taking action :
Model-based vs. Model-free RL
A brief aside to frame the following discussion: RL methods are often split into model-based and model-free approaches. Model-based RL learns (or assumes access to) the environment dynamics (transition probabilities and reward function ), while model-free RL learns value functions or policies directly from experience (no explicit environment model).
Value Iteration
Let’s assume we know the environment dynamics. How can we recover the optimal policy?
Bellman Equations
The Bellman optimality equations say that the optimal value function must satisfy:
For the optimal action-value function ,
The Bellman optimality operator maps any value function to
Notice that the Bellman optimality equation is simply requiring that the optimal value function is a fixed point of the Bellman operator:
To measure how good a candidate value function is, we use the Bellman error (also called the Bellman residual), i.e., how far is from being a fixed point of a Bellman operator. The Bellman error is then
Value Iteration Algorithm
A natural idea arises out of the Bellman operator. We can initialize the value function for all states to some value, then repeatedly apply the Bellman operator until we get the optimal value function:
Initialize V(s) = 0 for all s in S
repeat
Δ <- 0
for each state s in S:
v <- V(s)
V(s) <- max_a [R(s, a) + γ * Σ_{s' in S} T(s' | s, a) V(s')]
Δ <- max(Δ, |v - V(s)|)
until Δ < ε
return V
To better understand value iteration, one should walk through a grid world example. The textbooks in Sources walk through grid world examples.
Policy Extraction
Given the optimal value function , extracting the optimal policy is simple:
Framing Value Iteration
Value iteration is relatively simple and has convergence guarantees, but there are a few issues:
- We often don’t know the environment dynamics.
- State spaces can be large and thus make summing over all states computationally infeasible.
Learning Without a Model
Monte Carlo Control
Monte Carlo methods use random sampling to estimate certain numbers. In our case, we can use rollouts to estimate values to avoid explicitly computing Bellman updates.
Value iteration becomes a lot more simple with Monte Carlo updates. For each iteration, we sample a trajectory from the environment starting at some state and following policy . Then, we update the value function for each state in the trajectory:
where is the learning rate and is the sampled return from time .
For Monte Carlo control, we apply the same update idea to Q-functions:
Then, we improve the policy greedily:
We repeat these two steps until the policy converges.
Since Monte Carlo control uses rollouts, we must follow a policy during rollouts (unlike in value iteration). If our policy doesn’t improve, then our Q-function estimates also won’t improve. This is why Monte Carlo control requires us to alternate between policy evaluation (updating the Q-function estimates) and policy improvement (improving the policy). Monte Carlo control is an instantiation of policy iteration.
Temporal Difference Learning
A major issue with Monte Carlo control is that we must obtain an entire trajectory before performing updates. For policy evaluation under , the ideal Bellman-style update is
where
Instead, we can approximate the Bellman update using one sample’s transition to the next state :
Plugging this approximation into the Bellman update gives the TD(0) update:
Notice that TD learning relies on bootstrapping, the idea that we can use a guess to make a better guess. The term is called the TD error.
TD(0) is an unbiased estimator For some quantity , an estimator is unbiased if . of the Bellman error (conditioned on the current state):
However, TD(0) suffers from high variance. To reduce variance, we can change the number of steps we look ahead to get n-step TD updates: