Week 1 - Smoov & Curly’s Bogus Journey

Review

The first part of this course is a review of:

Week 12 - Reinforcement learning

There is a change of notation within this course to the referenced lecture. Instead of using for the utility - instead we use for value. Instead of considering the reward function instead we consider it , i.e. the reward takes into account the action you have done. Therefore restated the Bellman equation in this notation is as below.

Reminder of the below notation.

  • Discount factor: with .
  • Transition probability: Given you are in state and you take action is the probability you end up in state .
  • States: is the set of all states.
  • Actions: is the set of all actions, it could depend on the state therefore we talk about for the actions at state .

Quality

Within the Bellman equation if we take what is within the brackets and set it to a new function the quality of taking action in state we then derive the next set of equations.

The motivation for doing this will come later, however intuitively this form will be more useful when you do not have access to and directly. Instead you can only sample ‘experience data’.

Continuations

We can apply a similar trick to derive a 3rd form of the Bellman equation this time we just set to be the summation within the definition of .

Each of these will enable us to do reinforcement learning in different circumstances - but notice how they relate to one another.

If we find we need to know both the transition probabilities and the reward to derive either or however and have a nice property that from we only need to know the transition probabilities to find and whereas from we only need to know the reward to determine and .