Week 9 - Clustering

Unsupervised learning

Unsupervised learning

Unsupervised learning

In unsupervised learning we are given an input domain with some training data which are not labelled (you may or may not be provided with an output codomain ) . The goal is to make some (and ) that groups items that are “related” to one another given .

Link to original

Transclude of Clustering-problem

Two trivial solutions to clustering problems are:

  • Set the , for all or
  • Set then for all .

Single Linkage Clustering

Single linkage clustering

Single linkage clustering

Suppose we are in the clustering problem set up. Here we are given the number of desired clusters , and the sense of distance .

The idea will be to make every element in our training data its own cluster. Then we take the union of two clusters which are considered the closest. For this linkage clustering we use the minimum distance of points in the cluster. i.e.

where is the power set of . We keep iterating this until we are left with clusters.

Pseudocode

Name(k, d, T):
	Input:
		k - the number of clusters
		d - the a symetric distance on T
		T - training data
	Output:
		A function from f splitting T into k groups
1. Set f to be the identity function.
2. While the codain of f has size larger than k:
	1. Set C_a, C_b, min_distance to be empty 
	2. For add X, Y in f^{-1}(T):
		1. Caluclate dist = min_{x in X, y in Y} d(x,y)
		2. if dist < min_distance:
			1. Set C_a = X, C_b = Y
	3. Update f(C_b) = f(C_a)
3. Return f

Run time

If you implement this in the worst case it is but realistically you could get it to be .

Correctness

Whilst it solves the problem it can lead to counter intuitive answers as chains for very closely grouped points will cluster together even when by eye you might split them differently.

Link to original

k-means clustering

k-means clustering

k-means clustering

Suppose we are in the clustering problem set up. Suppose we are provided the number of clusters. Further assume is a space with some method of averaging which satisfies

In the Euclidian example we will have

To start the algorithm we pick random points in our training data . We set these to be the current centres for . For iteration assume we have centres (they don’t need to be points in ). Now we group the training data

Then we update our centres . We keep iterating until for all .

Pseudocode

Name(k, T, d, Avg):
	Input:
		k - the number of clusers
		T - the training data
		d - the symetric distance function on A
		Avg - a method of averaging points
	Output:
		A function f mapping T to k groups.
1. Pick k random points in T and set this to be center_i
2. Set f_last, f_current to be empty
3. While f_last empty or f_last != f_current:
	1. Set f_last = f_current
	3. Set center_i = Avg(f_last^{-1}(i)) for all 1 <= i <= k
	4. Set f_current(t) = argmin_j d(t, center_j) for all t in T
4. Return f_current

Run time

We will go on to “hand wave” that this algorithm can never return to the same state without stopping. Let then the number of possible iterations is (lazy estimations on the number of partitions). Each iteration takes as we need to calculate the distance between each of the centres for each point and then calculate overages of each of the centres.

This gives the run time to be .

In practice this algorithm is much faster than the run time lets on.

Correctness

Correctness in this situation will mean that it will terminate eventually. To “handwave” this we will phrase this problem as a Hill climbing problem.

Let the configurations be the partition with the centres . Then we can score each configuration as

To define a configurations neighbourhood let and $$ C_i’ = \left {t \in T \Bigg \vert \left ( \mbox{ arg}\min_{1 \leq j \leq k} d(t, center_j) \right ) = i \right }

The first inequality follows from the property of average we have required. The second one follows as we picked to minimise distance between the points and the centres.

Then our algorithm is identical to the Hill climbing algorithm in this set up.

Double hop

This argument only works when you think of one iteration of our algorithm being 2 steps in the Hill climbing. First you update the centres, one hop, then you update the partition, two hop. The state we visit after the first state will only have at most 1 neighbour as the previous step we will have picked either the center or partitions to be optimal.

Therefore we don’t revisit a state?

Might only be true in Euclidian spaces

Given this is the same as Hill climbing we can get stuck in a local optimum which is not a global optimum. There are two ways to get around this:

  • Do it with random restarts, or
  • Pick the starting states well.
Link to original

Soft clustering

Expectation Maximisation

Expectation Maximisation

Expectation maximising is an algorithm which can be used on problems such as clustering problem. It is used in situations where you assume your data follows some underlying distribution which in this instance is given by parameters . We assume our hypothesis space.

Suppose we draw samples from which has an observed component and an unobserved component . Therefore the samples of are given by .

Once you are provided with you can treat the values of as random variables - similar with . Both and depend on the unknow parameters .

During the EM algorithm we will update our current hypothesis for the parameters , to a new hypothesis . We want to update by trying to maximise

Lets unpack this - we use our current hypotheses to guess at the values of . This gives us and therefore . Now we want to find the hypothesis (or parameters for ) that maximises the log probability that would occur. This follows a nice natural two step process.

  1. Given our current hypothesis calculate the values for the unobserved parameters, this gives us the function , and then
  2. Using the values of and for - find a hypothesis

If the function is continuous for all then we will converge to a local maximum however we don’t know how long this will take us or if this local maximum is a global maximum.

Correctness

  • We are monotonically decreasing Maximum likelihood error of the groupings.
  • Though this is not guaranteed to converge.
  • Though it will not diverge either!
  • It can get stuck in local optimums.
    • So it is best that we restart this algorithm and take the best outcome.
Link to original

A concrete example of this comes from using Normal distributions for this. Assume our classes are given by normal distributions where each one has the same standard deviation but a different mean .

Then we are going to iteratively update

  1. the expectation of event that the point belongs to the class given the hypothesis that the clusters use , and
  2. the parameters using the distribution of .

First we will describe step one, given some new means . We set

Where the formula for comes from the normal distribution .

The second step just calculates the best parameters given these probabilities. In this case mean

This process is not guaranteed to stop.

Correctness

  • We are monotonically decreasing Maximum likelihood error of the groupings.
  • Though this is not guaranteed to converge.
  • Though it will not diverge either!
  • It can get stuck in local optimums.
    • So it is best that we restart this algorithm and take the best outcome.

Soft clustering

Soft clustering

Suppose we are in the clustering problem set up. Soft clustering will use Expectation Maximisation with the Gaussian distribution. Here we fix a and then we have

We will use a mean to optimise at each step.

Correctness

Link to original

Clustering properties

There are 3 properties we will want to have from a clustering algorithm.

  • Richness: For any assignment of objects to clusters, there is some distance such that returns that clustering.
  • Scale-invariant clustering: Scaling distances by a positive value does not change the clustering.
  • Consistent clustering: Shrinking intra-cluster distances and expanding inter-cluster distances does not change the clustering.

For example, the Single linkage clustering as we have defined

  • Is not rich, as it can’t classify all points as the same.
  • Is scale-invariant, as scale preserves the minimum.
  • Is consistent, little harder to show but I believe it.

Unfortunately we can’t have them all!

Impossibility Theorem

Statement

Lemma

Any clustering algorithm can not be all of the following:

Proof

Link to original