k-nearest neighbour
This algorithm is an Instance-based learning model and uses closest neighbours to determine what values should be.
In the modelling framework, suppose we are given:
- training data
, - a metric on
called , to be the number of points to consider, and - an averaging mechanism
that takes a weighted set of values in and returned another value in - like the weighted geometric mean. Then to model our data when given a new input point we find the -nearest neighbours in our training data then take the average of their associated values in .
This model is very simple - however a lot of the complexity comes in defining
Pseudocode
KNN(T, d, k, V, a):
Input:
T our training data which is points in A and B
d metric on the space of A
k the number of points to average
V a way of deciding how to pick out of the neighbours
a in A
Output:
b in B the models prediction
1. Set N = get_k_nearest_neighbours(T, d, k, a)
2. Set b = V({(b',d(a',a)) for (a',b') in N})
3. Return b
get_k_nearest_neighbours(T, d, k, a):
Input:
T our training data which is points in A and B
d metric on the space of A
k the number of points to average
V a way of deciding how to pick out of the neighbours
a in A
Output:
N subset A which are the k-nearest neighbours to a in T
1. Let X be a set
2. For each (a',b') in T:
2.1. Append ((a',b'), d(a',a)) to X
3. Sort X by the second parameter.
4. Let N = {t for the first k elements (t,d') in X}
# You can also let N include elements of equal distance as the k'th
# element
5. Return N
Run time
This algorithm is a lazy learner and puts of compute until it is executed. This means learning is
Correctness
This model has some Bias (modelling) for the representations it prefers:
- Near points are similar,
- The function it models is smooth with respect to the distance,
- All features equally matter,
- All points equally matter, and
- Input points are represented by the training data.