Value iteration (MDP)

This is a method of solving Markov decision processes. It uses discounted rewards to get a local as well as global optimum solution.

We ideally would like to work out what the best policy is

To help us work out what should be lets define the utility of a state

Given we could calculate the above for a given state then we can write down what the optimum strategy should be

This equation should concern you as to work out you would have to know which will depend on making it circular. However using the above strategy we get a nice set of simultaneous-like equations using the following.

What is above is a Bellman equation. They are simultaneous-like due to the max in it.

Pseudocode

We would like to solve it like a set of simultaneous equations but that doesn’t work due to the max. However instead we can interactively calculate it by doing the following:

  1. Start with some guess at called
  2. The for set

keep iterating until we converge to a stable answer.