Binary classification is the act of discriminating an item into one of two groups based on specified measures or variables. While previously we have discussed methods for determining values of logic gates using neural networks (Part 1 and Part 2), we will begin a series on clustering algorithms that can be performed in Matlab, including the use of k-means clustering and Gaussian Mixture Models. Prior to doing so, we will discuss how classification is evaluated, primarily through the discussion of sensitivity, specificity and the way to calculate these values through Matlab.
In order to properly evaluate a classifier, the outcome of the classifier or test is compared to that of the gold standard. High accuracy is assured when correct classification of members into both groups can be accomplished. For example, lets look at the following example (fake samples) where two clusters are generated based on a person’s height and weight. The goal is to try and determine gender of each individual.
The dots represent a measure’s prediction for classifying an individual as male (red) or female (blue). Circles represent mistakes that were made in classification versus the gold standard. Based on this classification, we can create a contingency table that describes all the different combinations of correct and incorrect classifications.
The terms true positive, true negative refer to the correct identification of females and males (or any test outcome). The terms false negative and false positive refer to the incorrect classification of females and males (or again, any outcome variable). The values for the TP, TN, FP and FN can be derived based on a simple Matlab function written as follows.
function [sens spec ppv npv] = contingency_table(gold,test_outcome) % Where the input gold refers to the gold standard classification % of all individuals with 1s and 2s representing females and males. % test_outcome is the predicted classification of females and males, % again as 1s or 2s. % sens, spec, ppv and npv stand for sensitivity, specificity, % positive predictive value, and negative predictive value % True positives TP = test_outcome(gold == 1); TP = length(TP(TP == 1)); % False positives FP = test_outcome(gold == 2); FP = length(FP(FP == 1)); % False negatives FN = test_outcome(gold == 1); FN = length(FN(FN == 2)); % True negatives TN = test_outcome(gold == 2); TN = length(TN(TN == 2)); % Sensitivity sens = TP/(TP+FN); % Specificity spec = TN/(FP+TN); % Positive predictive value ppv = TP/(TP+FP); % Negative predictive value npv = TN/(FN+TN);
Sensitivity and specificity tells us our ability to properly classify females and males, respectively. In this example, we have 82% (41/50) sensitivity and 92% (46/50) specificity. While the overall goal of any classification scheme is to obtain high accuracy results, as you will see in future tutorials, as the classification threshold is modified, this could affect sensitivity (or specificity) in favor of specificity (or sensitivity). This trade-off between accuracy is often demonstrated through the receiver operating characteristic (ROC) curve.
The next tutorial will demonstrate how to actually utilize clustering techniques to classify individuals into two distinct categories. In order to evaluate performance, we will use the concepts and terms described here.