Value iteration (MDP)
This is a method of solving Markov decision processes. It uses discounted rewards to get a local as well as global optimum solution.
We ideally would like to work out what the best policy is
To help us work out what
Given we could calculate the above for a given state then we can write down what the optimum strategy should be
This equation should concern you as to work out
What is above is a Bellman equation. They are simultaneous-like due to the max in it.
Pseudocode
We would like to solve it like a set of simultaneous equations but that doesn’t work due to the max. However instead we can interactively calculate it by doing the following:
- Start with some guess at
called - The for
set
keep iterating until we converge to a stable answer.