K-Nearest Neighbors
Definition
The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.
How it works
The goal of the k-nearest neighbor algorithm is to identify the nearest neighbors of a given query point, so that we can assign a class label to that point. To determine which data points are closest to a given query point, the distance between the query point and the other data points will need to be calculated. The distance measures used can vary depending on the data set or implementation and help inform decision boundaries, which query points into different regions. Then, by defining the k-value (the number of neighbors to be checked to determine the classification of a specific query point), the data can be assigned its class label.
For classification problems, a class label is assigned on the basis of a majority vote—i.e. the label that is most frequently represented around a given data point is used (the term “majority vote” is commonly used in literature, however, the technique is more technically considered “plurality voting”). Regression problems use a similar concept as classification problem, but in this case, the average the k nearest neighbors is taken to make a prediction about a classification. The main distinction here is that classification is used for discrete values, whereas regression is used with continuous ones.
Unlike other algorithms that explicitly model the problem, such as linear regression, KNN is instance-based. It means that the algorithm doesn't explicitly learn a model. Instead, it memorizes the training instances and uses them as "knowledge" for the prediction phase. It's also worth noting that the KNN algorithm is also part of a family of “lazy learning” models, meaning that it only stores a training dataset versus undergoing a training stage.
Considerations
Scaling: Scaling is a problem as KNN is a lazy algorithm and takes up more memory and storage compared to other classification methods.
Implementation and Hyperparameters: As KNN only requires a k-value and a distance metric, it is often an easy implementation and can adjust will to new training data.
Key Test Considerations
Supervised Learning:
- Cross Validation: As cross validation methods like k-fold, leave-one-out, and stratified cross validation can help validate model performance. However, nuances like pessimism bias in k-fold cross validation or high variability in leave-one-out cross validation may need consideration.
Classification:
ROC Curve: A standard technique used to summarize classifier performance over a range of tradeoffs between true and false positives is the Receiver Operating Characteristic (ROC) curve.
Data Imbalance: Imbalanced data sets where one class significantly outnumbers others, under sampling techniques like SMOTE may be beneficial in sampling minority classes.
K-Nearest Neighbor
Choice of K: The number of neighbors, K, affects the decision boundary. A smaller K can lead to a noisy decision boundary, while a large K can smooth it out, but may also blur class distinctions.
K-d Tree: Exact searching on large datasets can be computationally costly and inefficient. Implementing approximate nearest neighbor algorithms like the K-d tree algorithm.
Dimensionality: KNN does not perform well while using high-dimensional data and can be sensitive to irrelevant features which can lead to overfitting.
Distance Metric: Choosing the appropriate distance metric (Euclidean, Manhattan, MinKowski, Hamming etc.) is essential, based on the nature of the data.
References
- IBM. K-Nearest Neighbors Algorithm. Link.
- Muja, M., & Lowe, D. G. (2014). Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence. Link.
- Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. Link.
- Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th international joint conference on Artificial intelligence . Link.