In unsupervised learning we are given an input domain with some training data which are not labelled (you may or may not be provided with an output codomain) . The goal is to make some (and ) that groups items that are “related” to one another given .
This is the most naïve random optimisation algorithm. Suppose we have a optimisation problem where has some sense of neighbourhood for each point .
Start at some random point then move to the point
keep iterating until .
Pseudocode
hill_climber(optimise, neighourhood): Input: optimise: The function you are looking to optimise neighbourhood: A function that returns the neighbourhood of a point A -> P(A). Output: Some a in A that is a local optimum for optimise.1. Select random point current_best in A.2. Let next_best be some different point in A.3. While current_best != next_best: 1. Let current_best = next_best 2. next_best_score = optimise(current_best) 3. for n in neighourhood(next_best): 1. if optimise(n) > next_best_score 1. next_best_score = optimise(n) 2. next_best = n4. Return current_best
Run time
This strongly depends on the sense of neighbourhood. For example if the neighbourhoods were the entire domain this might be very slow.
Correctness
It will only return you a local optimum but no guarantee it will hit a global optimum.
We need that the sense of neighbourhood kind of aligns with how the function behaves otherwise we are just looking at random subsets of points.
The next class of algorithms is like hill climbing but we allow for a bit more random walking.
Simulated Annealing
Simulated Annealing
Suppose we have a optimisation problem where has some sense of neighbourhood for each point .
Then we want to emulate Hill climbing but probabilistically allow for random walking. We do this by randomly deciding to take moves from to based on the temperature of the algorithm at that time this is given by
Then we just modify what is done in Hill climbing with this and gradually decrease the temperature and we have simulated annealing.
Pseudocode
simulated_annealing(optimise, neighourhood, max_itterations, update_temperature, starting_temperature): Input: optimise: The function you are looking to optimise neighbourhood: A function that returns the neighbourhood of a point A -> P(A). max_itterations: An integer that tells us when to stop. update_temperature: A method to update the temperature. starting_temperature: The starting value of the temperature. Output: Some a in A that hopefully is a local optimum.1. Pick a random start point current in A.2. Set temperature = starting_temperature3. For iteration in 1, ..., max_itterations 1. Pick a random point to_switch in N(current). 2. Set current = to_switch with probability P(current, to_switch, temperature) 3. Set temperature = update_temperature(T, iteration)4. Return current
Notice with the formula that if then this will make for points where so it will start acting like Hill climbing whereas as we have so it will be close to a random walk.
Simulated annealing ending probability
Statement
Lemma
In the Simulated Annealing algorithm with some assumption which are not specified we have
In all the above examples, we just updated our next guess based on local properties. However, we might find a couple of points with good fitness and want to combine them to find out what properties they have that provide that good fit.
We keep doing this for a number of generations and then stop at some point determined by a stopping criteria.
Pseudocode
genetic_algorithm(optimise, selection_function, mutation, crossover, stopping_condition, population_size): Input: optimise: The function you are looking to optimise selection_function: The function that picks which population to keep mutation: A function that generates small mutations. crossover: A function that can make new points from two other. stopping_condition: Some condition to stop. population_size: The size of the population each itteration (can change on some parameters). Output: Some a in A that hopefully is an optimum.1. Generate a random starting population of population_size.2. While stopping_condition is not met 1. Let survivors = selection_function({(a, optimise(a)) for a in popluation}) 2. Use mutation and crossover to generate a new population of size population_size.3. Return argmax optimise(a) for a in popluation
We could choose an arbitrary point and do a switch of bit before and after this point. For example where the first 4 bits of the first output is from the 1st input and the second 4 bits if from the second input - then vice versa for the second output.
Uniform crossover
We could instead choose bits at random - either uniformly or weighted by their fitness. For example where for the 3rd, 5th and 8th bit we choose from the first, second and second for the first output and vice versa for the second.
Crossover choice
Notice that any choice of crossover function relies heavily on how you represent your data!
Randomised algorithms
All the randomised algorithms here have been fairly simply and don’t use history. There are more complicated ones that combine ideas from different algorithms and try to model the area they are searching.
MIMIC
MIMIC (meta)
MIMIC (meta)
Suppose we have some Optimisation problem with input space and fitness function . Then for any define
Estimating the probability distribution using dependency trees
Dependency Trees (Bayesian Network)
Dependency Trees (Bayesian Network)
A Bayesian network is a dependency tree if is a directed tree. That is every node has at most one parent. Therefore there is a function which defines every vertices parent or lack of it, which gives
As always lets assume we are in the modelling framework and we can break down our domain into features, we assume similarly the random variable breaks down equally .
To build a probability distribution on we will model it using a dependency tree. Where will be the probability that an input uses in the feature and has a fitness function larger that for some step . Similarly for however we are assuming that feature is for some reason dependent on the .
Just like with Bayesian network we can calculate and using the samples - however unlike the Bayesian network each time we do this we need to pick the most meaningful relationships to use.
Now for every choice of to define the dependency tree lets try to minimise the KL-divergence. We are going to assume we have the perfect probability distribution and we are modelling it using our dependency tree where
here we are making the assumption when we do the right thing. Moreover we are assuming we are using the real distribution .
Where is Conditional entropy. Then as doesn’t depend on when looking for a maximum we can ignore it. Though as it is mathematically convenient we will add . All together this is
So we just need to calculate for every pair of and . Then we can find an MST in the complete graph on where the edges are weighted by .
Once we have the MST we can choose an arbitrary root and direct the edges accordingly!
Now we have the dependency tree structure we can generate the probabilities using our samples.
Bringing this all together we have the algorithm.
MIMIC by dependency trees
MIMIC by dependency trees
This follows on from MIMIC (meta). Where we are going to use dependency trees to calculate the distribution. Here we are just implementing a version of the probability_constructor.
We assume the domain of our problem breaks up into features, we will also break up the random variable on this also. To generate the probability distribution we are going to build a dependency tree. To decide the form of this dependency tree - due to some maths - we will use provide pass or fail samples to calculate Mutual information for each pair of features.
To do this we need to calculate for all , , and for each . If we have samples of pass fails respectively this is simply
however there are more advanced technique you can apply here.
Then construct a MST on this graph and pick one node to be the root. This gives us our dependency tree. We can now use the probabilities calculated earlier to find the probabilities associated with this tree.
Once we have this Bayesian network we can use it to simulate new samples to evaluate on.