Instance:
a collection (dataset) of datapoints from \(\mathbf{X}\)
a classification system \(C = \{c_1, c_2, \dots c_r\}\)
Solution: classification function \(\gamma: \mathbf{X} \rightarrow C\)
Measure: misclassification
\(r=2\): positive and negative.
Misclassification is described by the confusion matrix, which scores the result of classification against labeled examples.
Often one class is of more interest than the other: better measures are needed.
With just two numerical dimensions, datapoint similarity can be interpreted as simple Euclideian distance.
Being very close \(\Longleftrightarrow\) being very similar.
Are 4 and 6 more similar to each other than -1 and +1?
Assumption: small changes in the values won’t alter the classification, close points will receive the same classification.
if close distance then assing same class
Take a set of labeled points (the examples), all others are blank at the moment.
Whenever a blank point has a nearest neighbor datapoint which is labeled, give it the same label
This is the NN, or 1-NN algorithm.
\[\gamma(\mathbf{x}) = y_i, i = \text{argmin}_j ||\mathbf{x}_j - \mathbf{x}||\]
Consider the k nearest neighbors
Assign the class that is the most common among them
Variation: consider each label relative to the effective distance of the neighbor.
Given the labeled examples, k-NN determines the areas around each example which give a certain class.
k-NN learns an area or surface and applies it in classification
A larger k does not always mean a better classification
k-NN
introduces us to voting systems
is effective when the two classes are balanced, i.e., not skewed, in the dataset
there is no standard way to choose k, yet it may greatly influence the outcome:
on large training datasets, even 1-NN approaches the irreducible_error_rate (2x).
Sometime improving accuracy on the training data does not translate into improved accuracy in testing against unseen data
1-NN is perfect on training but 0.7 on test.
Higher k’s do not improve much and overfitting creeps in.