Week 12 - Reinforcement learning
Transitions (MDP)
Link to originalTransitions (MDP)
When presented with a learning problem you are not provided with the Markov decision processes instead you will be provided with transitions which are the outcome of a previous action - i.e. a tuple
where is the start state, is the action taken, is the reward gotten and the is state you ended up in. These can either be pre-computed or given a state the learn can provide the action to find out what the transition was.
Reinforcement learning
Link to originalReinforcement learning
This is a branch of Machine Learning applied in a setting where a learner has a number of actions to take in a given state - each action will lead to a different reward and we want to maximise it. You are provided with previous transitions and want to build a policy in the sense of a Markov decision process.
There are 4 kinds of process we think about:
- Planning:
- Input: Model (such as Markov decision process)
- Output: Policy
- E.g. Value iteration (MDP) or Policy Iteration (MDP)
- Learning:
- Input: Transitions
- Output: Policy
- E.g. Reinforcement learning
- Modelling:
- Input: Transitions
- Output: Model
- Simulation:
- Input: Model
- Output: transitions
We can compose these strategies in two major ways.
The first uses a simulator and learner to generate at RL planner. This method was the first major successful RL algorithm creating a competitive backgammon solver.
Model-based reinforcement learning
Link to originalModel-based reinforcement learning
Model-based reinforcement learning is a method of doing reinforcement learning where you use the provided transitions to build your model of the problem then use a planner such as Value iteration (MDP) or Policy Iteration (MDP) to produce a policy.
Approaches to Reinforcement learning
There are 3 different approaches to RL,
- Policy search: Here you iterate through different polices to get the closest fit to the data.
- Value-function based: Here you use your example transitions to generate a utility function which maps state to value. Which like in Policy Iteration (MDP) you can use make policy.
- Model-based reinforcement learning: You use the transitions to generate a model, which consists of a transition function and reward function.
Value-function based
When using Value iteration (MDP) we look for the optimum utility function
Whereas for Policy Iteration (MDP) we look for the optimum policy given a utility function
Q-function (RL)
Link to originalQ-function (RL)
Suppose we are in a Markov decision process and using discounted rewards define the Q-function
as the solution to the following equations
The Q-function (RL) has nice properties as
from the definition of
as both
Incremental learning
Incremental learning
Link to originalIncremental learning
Suppose we have a sequence of learning rates
. Then a value incrementally learns some random variable if for with samples from we set then let . This is denoted .
If we have a sequence where
with
Q-learning
Q-learning
Q-learning is a reinforcement learning class of algorithms which are value function based. It uses the approach of Incremental learning of the Q-function (RL). We use the model of transitions where the learning can provide the action each iteration.
Pick an initial estimation
of the Q-function (RL). We need to pick a learning rate
such that Lastly pick how we will choose an action for a given state.
Then we incrementally learn
from our choices by Note as time changes we switch which state
we look at, and will choose different actions. Correctness
There is a theorem that states for a Markov decision process if we apply
Link to originallearning where a given a state is visited infinitely often, the states are sampled using the transition probabilities and the rewards are distributed correctly. Then and Q-learning converges correctly.
Choosing an action
The two main ideas for choosing actions are:
- choose randomly, and
- choose the best action.
The trouble with the first is you never use what you are learning - so it will be slow to explore the space you care about. The trouble with the second is you might not explore routes that are better than your current guess.
Explore exploit dilemma
Link to originalExplore exploit dilemma
In Machine Learning there is a dilemma when training a model. You can explore the state space more, which can waste time in places that are irrelevant or you can exploit what you know more but can get you in a rut not finding new ways to do things. This same dilemma applies far beyond machine learning into literally everyone’s daily lives.
You want to pick a balance between the two of them.
Epsilon-greedy exploration
Link to originalEpsilon-greedy exploration
-greedy exploration is a way of choosing actions in Q-learning. For a sequence at time step you choose action
In the case where you really do explore infinitely if
Optimistic exploration
Link to originalOptimistic exploration
Optimistic exploration is a way of choosing actions in Q-learning. Here we set the initial values of our
to all be very high and we always pick the action that maximises . Then in uncertainty it will explore actions it does not know about.