Week 1 - Chapter 3, Finite Markov Decision Process

Compulsory reading from Reinforcement learning: An introduction.

Finite Markov Decision Process

Finite Markov Decision Process

Finite Markov Decision Process

This summaries the environment that an actor in a discrete Markovian universe experiences. It is given by:

  • States: A finite set of states that the actor can be in.
  • Actions: For each state a finite set of actions , sometimes it is convenient to refer to this as .
  • Rewards: The value the actor gets from doing each action within a state, these are real values .

We assume the actor works on discrete time steps , at time it is in state , takes action and gets rewards . The actor deterministically chooses when in state but we have a probability distribution that determines the reward and next state

read this as, the probability of ending up in state with reward given they are in state and take action . This is what determines how the world progresses. Notice it is Markovian as it does not depend on .

It is sometimes useful to think of the state you are going to be in at time step as a random variable we refer to as , similarly for rewards as .

Link to original

Using the function we can calculate some other probabilities.

State transition probability: When in state and you take the action the probability of ending up in :

Expected state-action reward: Given you are in state and take action what is your expected reward:

Discounted returns

If we run a game with infinite steps which has positive expected reward for each step, then the value of the game is infinite. Which can make it hard to distinguish better options - as they both are over the long term worth infinite value. Therefore we instead consider discounted rewards.

Discounted rewards

Discounted rewards

Discounted rewards is a way to evaluate a Markov decision process with infinite steps. Let for be a sequence of rewards for this decision process. Let be some constant. Then we set the value of these choices as

If all are bounded by some constant then we have and this is finite.

Link to original

Value and policy functions

Policy (MDP)

Policy (MDP)

In a Markov decision processes a policy is how an actor will behave in a given situation, given by where . This concept can extend to become a probabilistic policy. Let be the set of probability distributions over . Then a probabilistic policy is given by where if is non-zero then .

Link to original

The goal of reinforcement learning is to determine the optimum policy that maximizes some reward function, such as discounted rewards. Though if the reward is too far in the future it can be hard to determine what to do earlier on. Therefore we determine the ‘value’ of a given state - this is a statement about its long term worth to the actor.

Value function (RL)

Value function (RL)

A value function is a mapping of states to a long term worth for a actor in a Markov decision process. This is defined by .

Given you are acting along policy we can calculate the ideal value function to be

Link to original

The best value function accurate evaluates the states long term worth to the actor. When using discounted rewards this can be given by the Bellman equation.

Bellman equation

Bellman equation

The Bellman equation is used to determine the optimum value function for a given Markov decision process. It defines this value function recursively as follows:

Link to original

It is important to note that for lots of problems computing this optimum is infeasible. For example in games like chess or go. So reinforcement learning focuses on techniques to provide computationally efficient ways to approximate this optimal.