Week 12 - Reinforcement learning

Transitions (MDP)

Transitions (MDP)

When presented with a learning problem you are not provided with the Markov decision processes instead you will be provided with transitions which are the outcome of a previous action - i.e. a tuple where is the start state, is the action taken, is the reward gotten and the is state you ended up in. These can either be pre-computed or given a state the learn can provide the action to find out what the transition was.

Link to original

Reinforcement learning

Reinforcement learning

This is a branch of Machine Learning applied in a setting where a learner has a number of actions to take in a given state - each action will lead to a different reward and we want to maximise it. You are provided with previous transitions and want to build a policy in the sense of a Markov decision process.

Link to original

There are 4 kinds of process we think about:

We can compose these strategies in two major ways.

The first uses a simulator and learner to generate at RL planner. This method was the first major successful RL algorithm creating a competitive backgammon solver.

Model-based reinforcement learning

Model-based reinforcement learning

Model-based reinforcement learning is a method of doing reinforcement learning where you use the provided transitions to build your model of the problem then use a planner such as Value iteration (MDP) or Policy Iteration (MDP) to produce a policy.

Link to original

Approaches to Reinforcement learning

There are 3 different approaches to RL,

  • Policy search: Here you iterate through different polices to get the closest fit to the data.
  • Value-function based: Here you use your example transitions to generate a utility function which maps state to value. Which like in Policy Iteration (MDP) you can use make policy.
  • Model-based reinforcement learning: You use the transitions to generate a model, which consists of a transition function and reward function.

Value-function based

When using Value iteration (MDP) we look for the optimum utility function

Whereas for Policy Iteration (MDP) we look for the optimum policy given a utility function We would like a way to incorporate both approaches.

Q-function (RL)

Q-function (RL)

Suppose we are in a Markov decision process and using discounted rewards define the Q-function as the solution to the following equations

Link to original

The Q-function (RL) has nice properties as

from the definition of . But also we can write

as both and don’t effect the .

Incremental learning

Incremental learning

Incremental learning

Suppose we have a sequence of learning rates . Then a value incrementally learns some random variable if for with samples from we set then let . This is denoted .

Link to original

If we have a sequence where

with is a random variable and are all i.i.d. then .

Q-learning

Q-learning

Q-learning is a reinforcement learning class of algorithms which are value function based. It uses the approach of Incremental learning of the Q-function (RL). We use the model of transitions where the learning can provide the action each iteration.

Pick an initial estimation of the Q-function (RL).

We need to pick a learning rate such that

Lastly pick how we will choose an action for a given state.

Then we incrementally learn from our choices by

Note as time changes we switch which state we look at, and will choose different actions.

Correctness

There is a theorem that states for a Markov decision process if we apply learning where a given a state is visited infinitely often, the states are sampled using the transition probabilities and the rewards are distributed correctly. Then and Q-learning converges correctly.

Link to original

Choosing an action

The two main ideas for choosing actions are:

  • choose randomly, and
  • choose the best action.

The trouble with the first is you never use what you are learning - so it will be slow to explore the space you care about. The trouble with the second is you might not explore routes that are better than your current guess.

Explore exploit dilemma

Explore exploit dilemma

In Machine Learning there is a dilemma when training a model. You can explore the state space more, which can waste time in places that are irrelevant or you can exploit what you know more but can get you in a rut not finding new ways to do things. This same dilemma applies far beyond machine learning into literally everyone’s daily lives.

Link to original

You want to pick a balance between the two of them.

Epsilon-greedy exploration

Epsilon-greedy exploration

-greedy exploration is a way of choosing actions in Q-learning. For a sequence at time step you choose action

Link to original

In the case where you really do explore infinitely if then and .

Optimistic exploration

Optimistic exploration

Optimistic exploration is a way of choosing actions in Q-learning. Here we set the initial values of our to all be very high and we always pick the action that maximises . Then in uncertainty it will explore actions it does not know about.

Link to original