# Week 12 - Reinforcement learning
![[Transitions (MDP)]]
![[Reinforcement learning]]
There are 4 kinds of process we think about:
- Planning:
- Input: Model (such as [[Markov decision process]])
- Output: Policy
- E.g. [[Value iteration (MDP)]] or [[Policy Iteration (MDP)]]
- Learning:
- Input: Transitions
- Output: Policy
- E.g. [[Reinforcement learning]]
- Modelling:
- Input: Transitions
- Output: Model
- Simulation:
- Input: Model
- Output: transitions
We can compose these strategies in two major ways.
The first uses a simulator and learner to generate at [[Reinforcement learning|RL]] planner. This method was the first major successful [[Reinforcement learning|RL]] algorithm creating a competitive backgammon solver.
![[Model-based reinforcement learning]]
## Approaches to [[Reinforcement learning]]
There are 3 different approaches to [[Reinforcement learning|RL]],
- Policy search: Here you iterate through different polices to get the closest fit to the data.
- Value-function based: Here you use your example transitions to generate a utility function which maps state to value. Which like in [[Policy Iteration (MDP)]] you can use make policy.
- [[Model-based reinforcement learning]]: You use the transitions to generate a model, which consists of a transition function and reward function.
## Value-function based
When using [[Value iteration (MDP)]] we look for the optimum utility function
$
U^{\ast}(s) = R(s) + \gamma \max_{a \in A(s)} \sum_{s' \in S} T(s,a,s') U^{\ast}(s').
$
Whereas for [[Policy Iteration (MDP)]] we look for the optimum policy given a utility function
$\pi^{\ast} = \mbox{arg}\max_{a \in A_s} \sum_{s' \in S} T(s,a,s')U^{\ast}(s').$
We would like a way to incorporate both approaches.
![[Q-function (RL)]]
The [[Q-function (RL)]] has nice properties as
$
U^{\ast}(s) = \max_{a \in A_s} Q(s,a)
$
from the definition of $Q$. But also we can write
$
\pi^{\ast}(s) = \mbox{arg}\max_{a \in A_s} Q(s,a)
$
as both $\gamma$ and $R(s)$ don't effect the $\mbox{arg}\max$.
## Incremental learning
![[Incremental learning ]]
If we have a sequence where
$\sum_{t \in \mathbb{N}} \alpha_t = \infty, \mbox{ and } \sum_{t \in \mathbb{N}} \alpha_t^2 < \infty
$
with $X$ is a random variable and $x_t$ are all [[Independent identically distributed samples|i.i.d.]] then $V = \mathbb{E}[X]$.
![[Q-learning]]
## Choosing an action
The two main ideas for choosing actions are:
- choose randomly, and
- choose the best action.
The trouble with the first is you never use what you are learning - so it will be slow to explore the space you care about. The trouble with the second is you might not explore routes that are better than your current guess.
![[Explore exploit dilemma]]
You want to pick a balance between the two of them.
![[Epsilon-greedy exploration]]
In the case where you really do explore infinitely if $\epsilon_t \rightarrow 0$ then $\hat{Q} \rightarrow Q$ and $\hat{\pi} \rightarrow \pi^{\ast}$.
![[Optimistic exploration]]