k-means clustering
Suppose we are in the clustering problem set up. Suppose we are provided
In the Euclidian example we will have
To start the algorithm we pick
Then we update our centres
Pseudocode
Name(k, T, d, Avg):
Input:
k - the number of clusers
T - the training data
d - the symetric distance function on A
Avg - a method of averaging points
Output:
A function f mapping T to k groups.
1. Pick k random points in T and set this to be center_i
2. Set f_last, f_current to be empty
3. While f_last empty or f_last != f_current:
1. Set f_last = f_current
3. Set center_i = Avg(f_last^{-1}(i)) for all 1 <= i <= k
4. Set f_current(t) = argmin_j d(t, center_j) for all t in T
4. Return f_current
Run time
We will go on to “hand wave” that this algorithm can never return to the same state without stopping. Let
This gives the run time to be
In practice this algorithm is much faster than the run time lets on.
Correctness
Correctness in this situation will mean that it will terminate eventually. To “handwave” this we will phrase this problem as a Hill climbing problem.
Let the configurations be the partition with the centres
To define a configurations
The first inequality follows from the property of average we have required. The second one follows as we picked
Then our algorithm is identical to the Hill climbing algorithm in this set up.
Double hop
This argument only works when you think of one iteration of our algorithm being 2 steps in the Hill climbing. First you update the centres, one hop, then you update the partition, two hop. The state we visit after the first state will only have at most 1 neighbour as the previous step we will have picked either the center or partitions to be optimal.
Therefore we don’t revisit a state?
Might only be true in Euclidian spaces
Given this is the same as Hill climbing we can get stuck in a local optimum which is not a global optimum. There are two ways to get around this:
- Do it with random restarts, or
- Pick the starting states well.