Classification
Definition
Classification uses an algorithm to accurately assign test data into specific categories.
How it works
Classification recognizes specific entities within the dataset and attempts to draw some conclusions on how those entities should be labeled or defined. Common classification algorithms are linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor, and random forest, which are described in more detail below.
Considerations:
There are many different types of classification algorithms for modeling classification predictive modeling problems.
There is no single theory on how to map algorithms onto problem types; instead, it is generally recommended that a practitioner use controlled experiments and discover which algorithm and algorithm configuration results in the best performance for a given classification task.
Key Test Considerations
Machine Learning:
Verify the dataset quality: Check the data to make sure it is free of errors. Quantify the degree of missing values, outliers, and noise in the data collection. If the data quality is low, it may be difficult or impossible to create models and systems with the desired performance.
Verify development datasets are representative of expected operational environment and data collection means. Compare distributions of dataset features and labels with exploratory data analysis and assess the difference in tests on training data and tests on evaluation data (where the evaluation data must be drawn from a representative dataset.)
Use software libraries: and tools built for ML where possible, so that the underlying code is verified by prior use.**
Diagnose model errors with domain SMEs: Have problem domain SMEs investigate model errors for conditions for which the model may underperform and suggest refinements.
Classification:
Use Standard Classification Performance Measures: Not all of the following may be necessary, but should be considered for both verification (developmental test) and operational test stages use:
Accuracy: The fraction of predictions that were corret.
Precision: The proportion of positive identifications that were correct.
Recall: The proportion of actual positive cases identified correctly.
F-Measure: Combines the preicion and recall into a single score. It is the harmonic mean of the precision and recall.
Receiver Operating Characteristic (ROC) Curve: A ROC curve shows the performance of a classification model at all classification thresholds. It graphs the True Positive Rate over the False Positive Rate.
Area Under the ROC Curve (AUC): This measures the two-dimensional area under the ROC Curve. AUC is scale-invariant and classification-threshold invariant.
ROC TP vs FP points: In addition to a specific AUC score, the performance at points
Confusion Matrix: A confusion matrix is a table layout that allows the visualization of the performance of an algorithm. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa. It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions (each combination of dimension and class is a variable in the contingency table.)
Prediction Bias: The difference between the average of the predicted labels and the average of the labels in the data set. One should check for prediction bias when evaluating the classifier's results. Causes of bias can include:
Noisy data set: Errors in original data can as the collection method may have an underlying bias.
Processing bug: Errors in the data pipeline can introduce bias.
Biased training sample (unbalanced samples): Model parameters may be skewed towards majority classes.
Overly strong regularization: Model may be underfitting model and too simple.
Proxy variables: Model features may be highly correlated.
Overfitting and Underfitting: Overfitting occurs when the the model built corresponds too closely or exactly to a particular set of data, and thus may fail to fit to predict additional data reliably. An overfitted model is a mathematical model that contains more parameters than can be justified by the data. Underfitting occurs when the model built does adequately capture the patterns in the data. As an example, a linear model will underfit a non-linear dataset.
Platforms, Tools, or Libraries:
Python:
scikit-learn: Is a free software machine learning library for Python and includes features for classification.
TensorFlow: is an end-to-end source machine learning platform.
Keras: is an open-source library that provides a Python API designed to enable fast experimentation with deep neural networks.
PyTorch: Is a machine learning framework based on the Torch library.
R:
caret: Classification And REgression Training package contains functions to streamline model training for complex regression and classification problems.
randomForest: Implementation of classification and regression based on forest of trees.