Week 11 - Markov Decision Processes

Markov decision process

Markov decision process

A Markov decision process is defined by the following data:

  • A set of states,
  • A set of actions (which might be dependent on the state i.e. for each state we have a set ),
  • A model which determines for each set of states and action what the transition probabilities there are from to given action , i.e. , and
  • A reward for being at a given state. When provided with this we are looking for a policy that determines what action we take in each state.
Link to original

Find home

Suppose we have a world with 12 states defined by a 2d grid . Where state is can’t be entered, is home, is a pit of death, and is where we start.

gridworld_example

Here we define actions up, down, left, right where we can’t take an action that would lead us off the board or into . This gives us our function for . This world is probabilistic so for each action we have a chance of doing what you would expect the action to do, but a chance of taking an action perpendicular to it (if you can’t take that action then you stay where you are). This defines our transition probabilities, . Lastly if we make it home we get a point if we die in the pit of death we lose a point. All this data defines a Markov decision process, in fact a special kind of process.

Gridworld

Gridworld

A Grid world is any set of states that can be modelled as . Where states can only transition to adjacent states in the grid system.

Link to original

Given a Markov decision process we want to come up with a policy that determines what action we take in each state.

Assigning credit

In problems like the above we only get the reward at the end. Therefore at the end of the game we need to know which moves were good, and which were bad. This is hard to do without better feedback.

Credit assignment problem

Credit assignment problem

The credit assignment problem in Reinforcement learning refers to the challenge of properly attributing credit or responsibility to actions that led to a particular outcome.

Link to original

Cumulative rewards

Suppose we now get a reward for each state and the score of the run was the cumulative score of all the states we have visited.

For example in the example above instead of just providing a reward for the pit of death and getting home instead we provided a reward for all the other squares it might influence what we want to do. If it were strongly positive we might want to stay out for longer. If it were strongly negative we might chance walking past the pit of death to get home sooner or even jump in the pit of death!

Trouble with cumulative rewards and infinite time

Suppose our actions will lead to a sequence of rewards for . If there exists some such that for all then if we sum the rewards

Instead suppose there was another sequence of actions that lead to rewards with for all . Then taking this second set of actions we have

So we will have no preference over these two sets of actions even though it seems like the second set are strictly better. This is an issue so we want to factor this out.

Discounted rewards

Discounted rewards

Discounted rewards is a way to evaluate a Markov decision process with infinite steps. Let for be a sequence of rewards for this decision process. Let be some constant. Then we set the value of these choices as

If all are bounded by some constant then we have and this is finite.

Link to original

Value iteration (MDP)

Value iteration (MDP)

This is a method of solving Markov decision processes. It uses discounted rewards to get a local as well as global optimum solution.

We ideally would like to work out what the best policy is

To help us work out what should be lets define the utility of a state

Given we could calculate the above for a given state then we can write down what the optimum strategy should be

This equation should concern you as to work out you would have to know which will depend on making it circular. However using the above strategy we get a nice set of simultaneous-like equations using the following.

What is above is a Bellman equation. They are simultaneous-like due to the max in it.

Pseudocode

We would like to solve it like a set of simultaneous equations but that doesn’t work due to the max. However instead we can interactively calculate it by doing the following:

  1. Start with some guess at called
  2. The for set

keep iterating until we converge to a stable answer.

Link to original

We only care about the policy

Whilst we might worry that convergence of the above takes infinite time - in reality we only care about when the policy based on these utilities stops changing. This might take less time.

Policy Iteration (MDP)

Policy Iteration (MDP)

This is a method of solving Markov decision processes. It uses discounted rewards to get a local as well as global optimum solution.

It uses the same set up as Value iteration (MDP) but instead we use the policy to decide the utility and iterate that.

Pseudocode

Instead of looking at the utility of a state we could instead look at the policy and use the utility to guide us.

  1. Start with a random policy .
  2. For do the following
    1. Calculate
      1. This is now a system of simultaneous equations as there is no max!
      2. Functionally we solve this by using value iteration, to find a stable.
    2. Set

Then stop once you reach some sense of convergence.

Link to original