Week 7 - Randomized Optimization

We are now moving onto to Unsupervised learning.

Unsupervised learning

Unsupervised learning

In unsupervised learning we are given an input domain with some training data which are not labelled (you may or may not be provided with an output codomain ) . The goal is to make some (and ) that groups items that are “related” to one another given .

Link to original

The difference from supervised learning is we don’t get given labels - so how we actually learn the function is beyond me!

Optimisation problem

Optimisation problem

Optimisation problem

For an optimisation problem we are given an input space and a fitness function where is ordered (usually ). We want to find

Link to original

There are lots of examples of this.

  • Process control or optimisation, think chemical factory.
  • Finding a route on a map.
  • Find a root of a function!
  • Neural network weights.
  • Minimising error in another model!

There are a couple of good methods here:

  • Generate and test
    • Good when we have a small input space but complex function.
  • Calculus:
    • We need the function to have a derivative and solvable!
  • Search method like Newton Raphson method
    • Similar problems to calculus.
  • Random Optimisation, todays lecture :)

Hill climbing

Hill climbing

Hill climbing

This is the most naïve random optimisation algorithm. Suppose we have a optimisation problem where has some sense of neighbourhood for each point .

Start at some random point then move to the point keep iterating until .

Pseudocode

hill_climber(optimise, neighourhood):
	Input:
		optimise: The function you are looking to optimise
		neighbourhood: A function that returns the neighbourhood of a point A -> P(A).
	Output: Some a in A that is a local optimum for optimise.
1. Select random point current_best in A.
2. Let next_best be some different point in A.
3. While current_best != next_best:
	1. Let current_best = next_best
	2. next_best_score = optimise(current_best)
	3. for n in neighourhood(next_best):
		1. if optimise(n) > next_best_score
			1. next_best_score = optimise(n)
			2. next_best = n
4. Return current_best

Run time

This strongly depends on the sense of neighbourhood. For example if the neighbourhoods were the entire domain this might be very slow.

Correctness

It will only return you a local optimum but no guarantee it will hit a global optimum. We need that the sense of neighbourhood kind of aligns with how the function behaves otherwise we are just looking at random subsets of points.

Link to original

Restart Hill climbing

Restart hill climbing

Random restart hill climbing

Suppose we have a optimisation problem where has some sense of neighbourhood for each point .

Then we want to run Hill climbing but try to get around the issue with only finding local optimum by at the end of a run restarting in a new place.

There are lots of choices and optimisations to make here, we will need to decide a stopping condition.

  • Number of iterations.
  • Stop improving on the optimum value.
  • A % of coverage of the points.

We can also make optimisations like remembering where a point lead us and choosing not to revisit the point - Memoization.

Run time

The run time can vary, however if we use Memoization we will try to save on number of times we run the function we are trying to optimise.

Correctness

This has a higher chance of finding a global optimum but no guarantee.

Link to original

Simulated Annealing

The next class of algorithms is like hill climbing but we allow for a bit more random walking.

Simulated Annealing

Simulated Annealing

Suppose we have a optimisation problem where has some sense of neighbourhood for each point .

Then we want to emulate Hill climbing but probabilistically allow for random walking. We do this by randomly deciding to take moves from to based on the temperature of the algorithm at that time this is given by Then we just modify what is done in Hill climbing with this and gradually decrease the temperature and we have simulated annealing.

Pseudocode

simulated_annealing(optimise, neighourhood, max_itterations, update_temperature, starting_temperature):
	Input:
		optimise: The function you are looking to optimise
		neighbourhood: A function that returns the neighbourhood of a point A -> P(A).
		max_itterations: An integer that tells us when to stop.
		update_temperature: A method to update the temperature.
		starting_temperature: The starting value of the temperature.
	Output: Some a in A that hopefully is a local optimum.
1. Pick a random start point current in A.
2. Set temperature = starting_temperature
3. For iteration in 1, ..., max_itterations
	1. Pick a random point to_switch in N(current).
	2. Set current = to_switch with probability P(current, to_switch, temperature)
	3. Set temperature = update_temperature(T, iteration)
4. Return current

Run time

Correctness

Link to original

Notice with the formula that if then this will make for points where so it will start acting like Hill climbing whereas as we have so it will be close to a random walk.

Simulated annealing ending probability

Statement

Lemma

In the Simulated Annealing algorithm with some assumption which are not specified we have

Proof

Link to original

Genetic Algorithms

In all the above examples, we just updated our next guess based on local properties. However, we might find a couple of points with good fitness and want to combine them to find out what properties they have that provide that good fit.

Genetic algorithm (meta)

Genetic algorithm (meta)

Suppose we have a optimisation problem which as two properties associated to is domain

  • There is a method to make a small change to any called a mutation where any of are a small change to .
  • It has some crossover function .

Then for a collection of points called a population the algorithm assess the fitness of each point for .

It then selects a subset to make the next generation. There are multiple ways to do this:

  • Take the most fit as ranked by fitness, or
  • Use ideas for Simulated Annealing and select points based on some weighted probability distribution based on .

The from our subset use the mutation and crossover function to generate a new population.

We keep doing this for a number of generations and then stop at some point determined by a stopping criteria.

Pseudocode

genetic_algorithm(optimise, selection_function, mutation, crossover, stopping_condition, population_size):
	Input:
		optimise: The function you are looking to optimise
		selection_function: The function that picks which population to keep
		mutation: A function that generates small mutations.
		crossover: A function that can make new points from two other.
		stopping_condition: Some condition to stop.
		population_size: The size of the population each itteration (can change on some parameters).
	Output: Some a in A that hopefully is an optimum.
1. Generate a random starting population of population_size.
2. While stopping_condition is not met
	1. Let survivors = selection_function({(a, optimise(a)) for a in popluation})
	2. Use mutation and crossover to generate a new population of size population_size.
3. Return argmax optimise(a) for a in popluation

Run time

Correctness

Link to original

For the crossover function lets consider a concrete example.

8-bit strings crossover example

Suppose , how could we design a crossover function?

Suppose we are merging and .

One point crossover

We could choose an arbitrary point and do a switch of bit before and after this point. For example where the first 4 bits of the first output is from the 1st input and the second 4 bits if from the second input - then vice versa for the second output.

Uniform crossover

We could instead choose bits at random - either uniformly or weighted by their fitness. For example where for the 3rd, 5th and 8th bit we choose from the first, second and second for the first output and vice versa for the second.

Crossover choice

Notice that any choice of crossover function relies heavily on how you represent your data!

Randomised algorithms

All the randomised algorithms here have been fairly simply and don’t use history. There are more complicated ones that combine ideas from different algorithms and try to model the area they are searching.

MIMIC

MIMIC (meta)

MIMIC (meta)

Suppose we have some Optimisation problem with input space and fitness function . Then for any define

You can't use 'macro parameter character #' in math modeP^{\theta}(a) = \begin{cases} \frac{1}{Z_{\theta}} & \mbox{if } f(a) \geq \theta\\ 0 & \mbox{otherwise.} \end{cases}$$ (Where $Z_{\theta}$ is some normalisation coefficient to make this a [[Probability distribution|probability distribution]].) Then let $\theta_{min} = \min_{a \in A} f(a)$ and $\theta_{max} = \max_{a \in A} f(a)$ then $P^{\theta_{min}}$ is the uniform distribution over $A$ whereas $P^{\theta_{max}}$ is the uniform distribution over the optimum. The goal if MIMIC is to simulate $P^{\theta}$ whilst slowing increasing $\theta$ to find the maximum. Functionally how we do that is similar to [[Genetic algorithm (meta)|genetic algorithm]]. We will suppose we currently have our [[Probability distribution|probability distribution]] $P^{\theta_t} : A \rightarrow [0,1] \subset \mathbb{R}$ we will obtain samples $S$ from $A$ using $P^{\theta_t}$. Let $\theta_{t+1}$ to be some predefined quartile of $f(S)$ for example the median. Then we have some positive sample $S' = \{s \in S \vert f(s) \geq \theta_{t+1}\}$ and negative samples $S \backslash S'$ to try and estimate the distribution $P^{\theta_{t+1}}$. The method to estimate $P^{\theta_{t+1}}$ from $S'$ encodes some domain knowledge about the system - this can very from problem to problem. ## Pseudocode ```pseudocode mimic(optimise, number_of_samples, percentile, probability_constructor, stopping_condition): Input: optimise: The function you are looking to optimise. number_of_samples: The number of samples to draw. percentile: The percentile to use to cut samples. probability_constructor: Some method that uses a probability distribution and +/-ve samples to generate another probability distribution. stopping_condition: Some condition to stop. Output: Some a in A that hopefully is an optimum. 1. Let P be the uniform distrubution 2. Whilst the stopping_condition is not met 1. Let S be number_of_samples drawn from P 2. Let theta be the percentile of {f(s) for s in S} 3. Let S' = {s in S such that f(s) >= theta} 4. Let P = probability_constructor(S', S \ S', P) 3. Draw a sample from P and return it. ``` ## Run time This strongly depends on how we construct our probability distribution. However, normally this takes longer than other randomised algorithms to run one iteration though it will take fewer iterations all together. This is useful when optimising on a function that is hard to compute. ## Correctness - MIMIC performs well when your feature space has structure within it that relates to the fitness function. - It can get stuck in local optima with the probability distribution but you can use different sampling methods from probability theory to help you in these situations.Link to original

Estimating the probability distribution using dependency trees

Dependency Trees (Bayesian Network)

Dependency Trees (Bayesian Network)

A Bayesian network is a dependency tree if is a directed tree. That is every node has at most one parent. Therefore there is a function which defines every vertices parent or lack of it, which gives

Link to original

As always lets assume we are in the modelling framework and we can break down our domain into features, we assume similarly the random variable breaks down equally .

To build a probability distribution on we will model it using a dependency tree. Where will be the probability that an input uses in the feature and has a fitness function larger that for some step . Similarly for however we are assuming that feature is for some reason dependent on the .

Just like with Bayesian network we can calculate and using the samples - however unlike the Bayesian network each time we do this we need to pick the most meaningful relationships to use.

Generating your dependency tree

At this point it is best to listen to Week 7 - Information Theory to get the terms required for this proof.

Now for every choice of to define the dependency tree lets try to minimise the KL-divergence. We are going to assume we have the perfect probability distribution and we are modelling it using our dependency tree where

here we are making the assumption when we do the right thing. Moreover we are assuming we are using the real distribution .

Where is Conditional entropy. Then as doesn’t depend on when looking for a maximum we can ignore it. Though as it is mathematically convenient we will add . All together this is

Where is Mutual information. We get a cool result here before we had is which is directional but is not as Mutual information is symmetric.

So we just need to calculate for every pair of and . Then we can find an MST in the complete graph on where the edges are weighted by .

Once we have the MST we can choose an arbitrary root and direct the edges accordingly!

Now we have the dependency tree structure we can generate the probabilities using our samples.

Bringing this all together we have the algorithm.

MIMIC by dependency trees

MIMIC by dependency trees

This follows on from MIMIC (meta). Where we are going to use dependency trees to calculate the distribution. Here we are just implementing a version of the probability_constructor.

We assume the domain of our problem breaks up into features, we will also break up the random variable on this also. To generate the probability distribution we are going to build a dependency tree. To decide the form of this dependency tree - due to some maths - we will use provide pass or fail samples to calculate Mutual information for each pair of features.

To do this we need to calculate for all , , and for each . If we have samples of pass fails respectively this is simply

however there are more advanced technique you can apply here.

Now we apply the formula of Mutual information, Information entropy, and conditional entropy to get .

Then we construct a complete undirected graph on then weight the edge with . (Note Mutual information is symmetric so we can just use where .)

Then construct a MST on this graph and pick one node to be the root. This gives us our dependency tree. We can now use the probabilities calculated earlier to find the probabilities associated with this tree.

Once we have this Bayesian network we can use it to simulate new samples to evaluate on.

Run time

Correctness

Link to original