Alex's Notes

❯

❯

Policy Iteration (MDP)

Policy Iteration (MDP)

Apr 06, 20241 min read

programming

Policy Iteration (MDP)

This is a method of solving Markov decision processes. It uses discounted rewards to get a local as well as global optimum solution.

It uses the same set up as Value iteration (MDP) but instead we use the policy to decide the utility and iterate that.

Pseudocode

Instead of looking at the utility of a state we could instead look at the policy and use the utility to guide us.

Start with a random policy .
For do the following
1. Calculate
  1. This is now a system of simultaneous equations as there is no max!
  2. Functionally we solve this by using value iteration, to find a stable.
2. Set

Then stop once you reach some sense of convergence.

Graph View

Policy Iteration (MDP)
Pseudocode

Backlinks

Week 11 - Markov Decision Processes
Week 12 - Reinforcement learning
Week 2 - Temporal Difference learning
Model-based reinforcement learning

Created with Quartz v4.5.1 © 2025

GitHub
Discord Community