Suppose we have a world with 12 states defined by a 2d grid . Where state is can’t be entered, is home, is a pit of death, and is where we start.
Here we define actions up, down, left, right where we can’t take an action that would lead us off the board or into . This gives us our function for .
This world is probabilistic so for each action we have a chance of doing what you would expect the action to do, but a chance of taking an action perpendicular to it (if you can’t take that action then you stay where you are). This defines our transition probabilities, .
Lastly if we make it home we get a point if we die in the pit of death we lose a point.
All this data defines a Markov decision process, in fact a special kind of process.
Gridworld
Gridworld
A Grid world is any set of states that can be modelled as . Where states can only transition to adjacent states in the grid system.
Given a Markov decision process we want to come up with a policy that determines what action we take in each state.
Assigning credit
In problems like the above we only get the reward at the end. Therefore at the end of the game we need to know which moves were good, and which were bad. This is hard to do without better feedback.
Credit assignment problem
Credit assignment problem
The credit assignment problem in Reinforcement learning refers to the challenge of properly attributing credit or responsibility to actions that led to a particular outcome.
Suppose we now get a reward for each state and the score of the run was the cumulative score of all the states we have visited.
For example in the example above instead of just providing a reward for the pit of death and getting home instead we provided a reward for all the other squares it might influence what we want to do. If it were strongly positive we might want to stay out for longer. If it were strongly negative we might chance walking past the pit of death to get home sooner or even jump in the pit of death!
Trouble with cumulative rewards and infinite time
Suppose our actions will lead to a sequence of rewards for . If there exists some such that for all then if we sum the rewards
Instead suppose there was another sequence of actions that lead to rewards with for all . Then taking this second set of actions we have
So we will have no preference over these two sets of actions even though it seems like the second set are strictly better. This is an issue so we want to factor this out.
Discounted rewards
Discounted rewards
Discounted rewards is a way to evaluate a Markov decision process with infinite steps. Let for be a sequence of rewards for this decision process. Let be some constant. Then we set the value of these choices as
If all are bounded by some constant then we have
and this is finite.
We ideally would like to work out what the best policy is
To help us work out what should be lets define the utility of a state
Given we could calculate the above for a given state then we can write down what the optimum strategy should be
This equation should concern you as to work out you would have to know which will depend on making it circular. However using the above strategy we get a nice set of simultaneous-like equations using the following.
What is above is a Bellman equation. They are simultaneous-like due to the max in it.
Pseudocode
We would like to solve it like a set of simultaneous equations but that doesn’t work due to the max. However instead we can interactively calculate it by doing the following:
Start with some guess at called
The for set
keep iterating until we converge to a stable answer.
Whilst we might worry that convergence of the above takes infinite time - in reality we only care about when the policy based on these utilities stops changing. This might take less time.