Policy Iteration (MDP)

This is a method of solving Markov decision processes. It uses discounted rewards to get a local as well as global optimum solution.

It uses the same set up as Value iteration (MDP) but instead we use the policy to decide the utility and iterate that.

Pseudocode

Instead of looking at the utility of a state we could instead look at the policy and use the utility to guide us.

  1. Start with a random policy .
  2. For do the following
    1. Calculate
      1. This is now a system of simultaneous equations as there is no max!
      2. Functionally we solve this by using value iteration, to find a stable.
    2. Set

Then stop once you reach some sense of convergence.