In unsupervised learning we are given an input domain with some training data which are not labelled (you may or may not be provided with an output codomain) . The goal is to make some (and ) that groups items that are “related” to one another given .
Suppose we are in the clustering problem set up. Here we are given the number of desired clusters , and the sense of distance .
The idea will be to make every element in our training data its own cluster. Then we take the union of two clusters which are considered the closest. For this linkage clustering we use the minimum distance of points in the cluster. i.e.
where is the power set of . We keep iterating this until we are left with clusters.
Pseudocode
Name(k, d, T): Input: k - the number of clusters d - the a symetric distance on T T - training data Output: A function from f splitting T into k groups1. Set f to be the identity function.2. While the codain of f has size larger than k: 1. Set C_a, C_b, min_distance to be empty 2. For add X, Y in f^{-1}(T): 1. Caluclate dist = min_{x in X, y in Y} d(x,y) 2. if dist < min_distance: 1. Set C_a = X, C_b = Y 3. Update f(C_b) = f(C_a)3. Return f
Run time
If you implement this in the worst case it is but realistically you could get it to be .
Correctness
Whilst it solves the problem it can lead to counter intuitive answers as chains for very closely grouped points will cluster together even when by eye you might split them differently.
Suppose we are in the clustering problem set up. Suppose we are provided the number of clusters. Further assume is a space with some method of averaging which satisfies
In the Euclidian example we will have
To start the algorithm we pick random points in our training data. We set these to be the current centres for . For iteration assume we have centres (they don’t need to be points in ). Now we group the training data
Then we update our centres . We keep iterating until for all .
Pseudocode
Name(k, T, d, Avg): Input: k - the number of clusers T - the training data d - the symetric distance function on A Avg - a method of averaging points Output: A function f mapping T to k groups.1. Pick k random points in T and set this to be center_i2. Set f_last, f_current to be empty3. While f_last empty or f_last != f_current: 1. Set f_last = f_current 3. Set center_i = Avg(f_last^{-1}(i)) for all 1 <= i <= k 4. Set f_current(t) = argmin_j d(t, center_j) for all t in T4. Return f_current
Run time
We will go on to “hand wave” that this algorithm can never return to the same state without stopping. Let then the number of possible iterations is (lazy estimations on the number of partitions). Each iteration takes as we need to calculate the distance between each of the centres for each point and then calculate overages of each of the centres.
This gives the run time to be .
In practice this algorithm is much faster than the run time lets on.
Correctness
Correctness in this situation will mean that it will terminate eventually. To “handwave” this we will phrase this problem as a Hill climbing problem.
Let the configurations be the partition with the centres . Then we can score each configuration as
To define a configurations neighbourhood let and $$
C_i’ = \left {t \in T \Bigg \vert \left ( \mbox{ arg}\min_{1 \leq j \leq k} d(t, center_j) \right ) = i \right }
The first inequality follows from the property of average we have required. The second one follows as we picked to minimise distance between the points and the centres.
Then our algorithm is identical to the Hill climbing algorithm in this set up.
Double hop
This argument only works when you think of one iteration of our algorithm being 2 steps in the Hill climbing. First you update the centres, one hop, then you update the partition, two hop.
The state we visit after the first state will only have at most 1 neighbour as the previous step we will have picked either the center or partitions to be optimal.
Therefore we don’t revisit a state?
Might only be true in Euclidian spaces
Given this is the same as Hill climbing we can get stuck in a local optimum which is not a global optimum. There are two ways to get around this:
Expectation maximising is an algorithm which can be used on problems such as clustering problem. It is used in situations where you assume your data follows some underlying distribution which in this instance is given by parameters . We assume our hypothesis space.
Suppose we draw samples from which has an observed component and an unobserved component . Therefore the samples of are given by .
Once you are provided with you can treat the values of as random variables - similar with . Both and depend on the unknow parameters .
During the EM algorithm we will update our current hypothesis for the parameters , to a new hypothesis . We want to update by trying to maximise
Lets unpack this - we use our current hypotheses to guess at the values of . This gives us and therefore . Now we want to find the hypothesis (or parameters for ) that maximises the log probability that would occur. This follows a nice natural two step process.
Given our current hypothesis calculate the values for the unobserved parameters, this gives us the function , and then
Using the values of and for - find a hypothesis
If the function is continuous for all then we will converge to a local maximum however we don’t know how long this will take us or if this local maximum is a global maximum.
Correctness
We are monotonically decreasing Maximum likelihood error of the groupings.
Though this is not guaranteed to converge.
Though it will not diverge either!
It can get stuck in local optimums.
So it is best that we restart this algorithm and take the best outcome.
A concrete example of this comes from using Normal distributions for this. Assume our classes are given by normal distributions where each one has the same standard deviation but a different mean .
Then we are going to iteratively update
the expectation of event that the point belongs to the class given the hypothesis that the clusters use , and
the parameters using the distribution of .
First we will describe step one, given some new means . We set
Where the formula for comes from the normal distribution .
The second step just calculates the best parameters given these probabilities. In this case mean
This process is not guaranteed to stop.
Correctness
We are monotonically decreasing Maximum likelihood error of the groupings.
Though this is not guaranteed to converge.
Though it will not diverge either!
It can get stuck in local optimums.
So it is best that we restart this algorithm and take the best outcome.