Week 2 - Temporal Difference learning
Within reinforcement learning there are broadly 3 types of algorithms:
- Model-based: These algorithms first aim to understand the environment by building or learning an explicit model of it. This model describes how the environment behaves (e.g., transition probabilities and rewards).
- Model Learning/Update: Run through an example (an interaction with the environment) to collect data and update the parameters of a MDP model. This typically involves estimating state transition probabilities and reward functions.
- Planning/Solving: Use the learned MDP model (e.g., using algorithms like Value iteration (MDP) or Policy Iteration (MDP)) to generate an optimal (or near-optimal) value function.
- Policy Derivation: Use the calculated value function and the learned model to derive an optimal policy. This policy dictates the best action to take in each state, often by choosing the action that maximizes the expected future reward (e.g., using argmax over the Q-values derived from the value function and model).
- Value-function based (model-free): These algorithms bypass the need to build an explicit model of the environment. Instead, they directly learn a value function (or action-value function) that estimates the goodness of states or state-action pairs.
- Value Function Update: Run through an example (an interaction with the environment) and use the observed outcomes (rewards and next states) to directly update the value function. This might be a state-value function V(s) or, more commonly, an action-value function Q(s,a), which estimates the expected return of being in state s and taking action a.
- Policy Extraction: Implicitly or explicitly derive a policy for each state by evaluating the actions that lead to the highest estimated value. For Q-functions, this often means choosing the action a that maximizes Q(s,a) for a given state s.
- Policy Search (or Policy Optimization): This approach directly searches for an optimal policy without necessarily learning a value function or a model. The goal is to find a policy that directly maps states to actions and maximizes the cumulative reward.
- Policy Evaluation and Update: Run through an example using the current policy. Based on the observed outcomes (rewards and trajectories), directly update the parameters of the policy to improve its performance. This often involves gradient-based methods where the policy parameters are adjusted in the direction that increases the expected return.
The different approaches have varying trade-offs. Generally, model-based approaches are often more computationally intensive for the planning step but can be more sample efficient (requiring fewer interactions with the environment) because they can “plan” internally without needing new real-world data for every decision. Conversely, model-free approaches (including value-function based and policy search) are typically less computationally intensive per update (as they don’t involve a full planning step) but are often more sample intensive, requiring many interactions with the environment to learn effectively. Policy search methods, in particular, can be good for continuous action spaces and may sometimes converge faster than value-based methods in certain complex scenarios, but they can also be more prone to local optima.
Temporal difference learning
The setup for Temporal Difference Learning involves a sequential interaction with an environment, typically broken down into distinct ‘runs’ or episodes. Each episode consists of a sequence of states, actions, and rewards. The overarching goal is to enable a learning agent to improve its actions in future episodes by refining its understanding of the value of different states.
For a single episode, say
where
We estimate
Notation
We can define
in multiple ways:
- First visit: Let
where and that is the smallest such where this is true. - Every visit: Let
adding the values for each visit.
If we have
This update moves the estimate
Learning rate convergence
Statement
Lemma
Given a Markov decision process
, let be the value estimate for a state for the ‘th state. If we update this using the following update rule: where
is a noisy sample of the true value with noise of mean 0, and is a learning rate. Then the incrementally learned will converge in the limit provided that for every state is visited infinitely often:
- The sum of the learning rates diverge:
, and - The sum of the squared learning rates converges:
. Proof
Link to original
However, calculating the full return
What happened to the action?
It’s important to recognize that this derivation focuses on the value function V(s). For directly learning which actions to take (i.e., for control), we typically extend these ideas to quality function
, which explicitly incorporate the chosen action. This leads to algorithms like SARSA and Q-learning.
TD(1)
In temporal difference (TD) methods, our aim is to update value function estimates incrementally as an episode unfolds, rather than waiting until the very end.
Previously we update a state’s value estimate based on the full return
TD(1) offers an “online” way to achieve the same result as the full return method. It does this by distributing the credit for the eventual return back to all states visited earlier in the episode. It uses a mechanism called eligibility traces
Eligibility Trace Mechanism for TD(1)
Within each episode, we maintain an eligibility trace
- Increment Current State’s Trace:
- Decay All Traces: For all states s,
(where is the discount factor)
This decay by
TD(1) Update Rule
When an episode begins we have the value function from the previous episode
Then, we update the value estimate for every state s based on this
where
Here’s the pseudocode for TD(1) (with accumulating traces):
Initialize V(s) arbitrarily for all s in S
For each episode:
Set V_old = V
// We will use V for V_new
Set e(s) = 0 for all s in S (reset traces for new episode)
Initialize starting state s_0
For each step t from 0 until episode terminates (s_t -> a_t -> r_{t+1} -> s_{t+1}):
// 1. Calculate TD Error for current step using V_old
// Note: if s_{t+1} is terminal, V_old(s_{t+1}) is 0
delta_t = r_{t+1} + gamma * V_old(s_{t+1}) - V_old(s_t)
// 2. Update eligibility trace for current state
e(s_t) = e(s_t) + 1
// 3. Update all state values in V_new and decay all eligibility traces
For each state s in S:
V(s) = V(s) + alpha * delta_t * e(s)
e(s) = gamma * e(s)
Output: V
The remarkable property of TD(1) is that, over the course of a single episode, the total update made to the value of a state
To see why the update rule is equivalent, lets assume some state
This generalizes to work for states that are revisited and matches the every visit definition of
Issues with TD(1)
While TD(1) offers the appeal of total returns but also suffers similar drawbacks:
- High Variance: TD(1) relies on the full, actual (but noisy) return
as its effective target (over an episode). This “target” can vary significantly from one episode to another, leading to high variance in the value function estimates. This high variance can make learning slow and unstable, especially in environments with long episodes or high stochastic. The updates for are influenced by potentially very distant future rewards, which can be unpredictable. - Sensitivity to Long Episodes / Computational Cost: Because it involves summing and propagating effects over entire episodes, TD(1) can be computationally more intensive per step for very long episodes. All eligibility traces for all states must be updated and decayed at each time step. Additionally, for extremely long episodes, the value of
for large can become infinitesimally small, making the traces effectively zero for distant past states, requiring significant computations for minimal impact. - No Immediate Bootstrapping Advantage: While the TD error
itself involves bootstrapping (using ), the cumulative effect of TD(1) over an episode is a Monte Carlo-like update. This means it doesn’t fully leverage the “bootstrapping” advantage of updating estimates based primarily on other estimates from immediate successor states. This immediate bootstrapping property can be highly beneficial for faster learning and reducing variance in certain scenarios.
These issues motivate the need for methods that can leverage the benefits of bootstrapping more directly, leading us to consider other forms of temporal difference learning, such as TD(0).
TD(0)
While TD(1) provides an elegant online method for computing total returns, it inherently suffers from practical issues, mainly high variance. The target for its effective update (
A different perspective for learning value functions is to aim for Bellman consistency. The true value function
TD(0) (or one-step TD) directly addresses this. Instead of waiting for the full episodic return, TD(0) uses a bootstrapped target based on the very next observed reward and the estimated value of the very next state. This target,
The TD(0) update rule for a state
where
Initialize V(s) arbitrarily for all s in S
For each episode:
Set V_old = V
// We will use V for V_new
For each step t from 0 until episode terminates (s_t -> a_t -> r_{t+1} -> s_{t+1}):
// 1. Calculate TD Error for current step using V_old
// Note: if s_{t+1} is terminal, V_old(s_{t+1}) is 0
delta_t = r_{t+1} + gamma * V_old(s_{t+1}) - V_old(s_t)
Set V(s_t) = V(s_t) + alpha delta_t
Output: V
TD(0) and Maximum Likelihood Estimation
From a theoretical standpoint, TD(0) can be seen as an approach to find the value function that is a maximum likelihood estimate (MLE) for the parameters of the Bellman equation, given the observed single-step transitions. More precisely, it often corresponds to minimizing the sum of squared one-step TD errors over the dataset of transitions.
While TD(1) aims to average the true total returns (making it an unbiased estimator but with high variance), TD(0) aims to achieve local consistency with the Bellman equations for each observed transition. This means that TD(0) is trying to make the value of