k-nearest neighbour

This algorithm is an Instance-based learning model and uses closest neighbours to determine what values should be.

In the modelling framework, suppose we are given:

  • training data ,
  • a metric on called ,
  • to be the number of points to consider, and
  • an averaging mechanism that takes a weighted set of values in and returned another value in - like the weighted geometric mean. Then to model our data when given a new input point we find the -nearest neighbours in our training data then take the average of their associated values in .

This model is very simple - however a lot of the complexity comes in defining and .

Pseudocode

KNN(T, d, k, V, a):
	Input:
		T our training data which is points in A and B
		d metric on the space of A
		k the number of points to average
		V a way of deciding how to pick out of the neighbours
		a in A
	Output:
		b in B the models prediction
1. Set N = get_k_nearest_neighbours(T, d, k, a)
2. Set b = V({(b',d(a',a)) for (a',b') in N})
3. Return b
 
get_k_nearest_neighbours(T, d, k, a):
	Input:
		T our training data which is points in A and B
		d metric on the space of A
		k the number of points to average
		V a way of deciding how to pick out of the neighbours
		a in A
	Output:
		N subset A which are the k-nearest neighbours to a in T
1. Let X be a set
2. For each (a',b') in T:
	2.1. Append ((a',b'), d(a',a)) to X
3. Sort X by the second parameter.
4. Let N = {t for the first k elements (t,d') in X}
     # You can also let N include elements of equal distance as the k'th
     # element
5. Return N

Run time

This algorithm is a lazy learner and puts of compute until it is executed. This means learning is but execution can take a while.

Correctness

This model has some Bias (modelling) for the representations it prefers:

  • Near points are similar,
  • The function it models is smooth with respect to the distance,
  • All features equally matter,
  • All points equally matter, and
  • Input points are represented by the training data.