Week 1 - Chapter 3, Finite Markov Decision Process
Compulsory reading from Reinforcement learning: An introduction.
Finite Markov Decision Process
Finite Markov Decision Process
Link to originalFinite Markov Decision Process
This summaries the environment that an actor in a discrete Markovian universe experiences. It is given by:
- States: A finite set of states
that the actor can be in. - Actions: For each state
a finite set of actions , sometimes it is convenient to refer to this as . - Rewards: The value the actor gets from doing each action within a state, these are real values
. We assume the actor works on discrete time steps
, at time it is in state , takes action and gets rewards . The actor deterministically chooses when in state but we have a probability distribution that determines the reward and next state read this as, the probability of ending up in state
with reward given they are in state and take action . This is what determines how the world progresses. Notice it is Markovian as it does not depend on . It is sometimes useful to think of the state you are going to be in at time step
as a random variable we refer to as , similarly for rewards as .
Using the function
State transition probability: When in state
Expected state-action reward: Given you are in state
Discounted returns
If we run a game with infinite steps which has positive expected reward for each step, then the value of the game is infinite. Which can make it hard to distinguish better options - as they both are over the long term worth infinite value. Therefore we instead consider discounted rewards.
Discounted rewards
Link to originalDiscounted rewards
Discounted rewards is a way to evaluate a Markov decision process with infinite steps. Let
for be a sequence of rewards for this decision process. Let be some constant. Then we set the value of these choices as If all
are bounded by some constant then we have and this is finite.
Value and policy functions
Policy (MDP)
Link to originalPolicy (MDP)
In a Markov decision processes a policy is how an actor will behave in a given situation, given by
where . This concept can extend to become a probabilistic policy. Let be the set of probability distributions over . Then a probabilistic policy is given by where if is non-zero then .
The goal of reinforcement learning is to determine the optimum policy that maximizes some reward function, such as discounted rewards. Though if the reward is too far in the future it can be hard to determine what to do earlier on. Therefore we determine the ‘value’ of a given state - this is a statement about its long term worth to the actor.
Value function (RL)
Link to originalValue function (RL)
A value function is a mapping of states to a long term worth for a actor in a Markov decision process. This is defined by
. Given you are acting along policy
we can calculate the ideal value function to be
The best value function accurate evaluates the states long term worth to the actor. When using discounted rewards this can be given by the Bellman equation.
Bellman equation
Link to originalBellman equation
The Bellman equation is used to determine the optimum value function for a given Markov decision process. It defines this value function recursively as follows:
It is important to note that for lots of problems computing this optimum is infeasible. For example in games like chess or go. So reinforcement learning focuses on techniques to provide computationally efficient ways to approximate this optimal.