Confusion Matrix and it's 25 offspring: or the link between machine learning and epidemiology

Photo by Markus Spiske on Unsplash.

Previous topics

If you know how to create a confusion matrix, you are good to go. But if you wish some clarification on that, you might have a look at the multiple logistic regression with interactions.

Why do we need confusion matrix?

  • it evaluates the predictive quality (performance) of a logistic regression, or any other classifier or a medical test,
  • it allows to compare numerous models or medical tests in order to choose the best,
  • it shows how and where exactly your model or test is wrong, by specifying the types of errors (Type I or Type II error), which gives you the opportunity to improve the model

How to get a confusion matrix

Here is the procedure for a model:

  • split the data into two datasets, training (ca. 80% of data) and testing (ca. 20%) datasets,
  • use only training dataset to make (to train) a model and
  • extrapolate this model onto a testing dataset, which does not have one column of our interest. In other words, use a trained model for predicting a removed column in the testing dataset,
  • compare the known vs. predicted value via 2x2 contingency table - confusion matrix.

In epidemiology we sometimes want to produce a new, better test for some disease. It can only be better then some existing “gold standard” test. And if imagine two columns of test results, one with the results of a “gold standard” and one with the result of the challenger, you’ll simply make cross table of these two columns in order to produce a confusion matrix.

Comparing the 1) real (observed/known/gold standard) values and 2) values predicted by the model or a medical test via a simple 2x2 contingency table would create only 4 numbers: true positives, false positives, true negatives and false negatives. The presence of false positives and false negatives shows that some values were confused by the model predictions. That is how a simple 2x2 table has got it’s fancy name - confusion matrix. Interestingly, having only 4 numbers, a confusion matrix is able to produce at least 25 useful indicators of predictive power!!! Quite informative, right?! Thus, in this post we’ll try to extract as much information from a confusion matrix as possible and describe the terminology.

The anatomy of a confusion matrix

Despite the hype of confusion matrix in machine learning area, the concept is not new. Confusion matrix was already successfully used in the epidemiology for a long time. For instance, imagine a study evaluating a new test that screens people for a disease. The test outcome can be positive (the person is sick) or negative (the person is healthy). The test results for each subject may or may not match the subject’s actual status, which results in 4 possible outcomes:

  • True positive (TP): Sick people correctly identified as sick.
  • False positive (FP): Healthy people incorrectly identified as sick. Type I error.
  • True negative (TN): Healthy people correctly identified as healthy.
  • False negative (FN): Sick people incorrectly identified as healthy. Type II error.

Any of these outcomes has two parts: the prediction (true or false) and a reality (sick (positive) or healthy (negative)). Here is an example of a confusion matrix (CM):

##           Outcome +    Outcome -      Total
## Test +           76           19         95
## Test -            2            3          5
## Total            78           22        100

The row-values in confusion matrix are predicted by a test or a model, where upper row displays positive values while lower row shows negative. The columns are observed (true) values, where the left column displays positive outcomes while the right column shows negative outcomes. The diagonal from the upper-left corner to the lower-right corner displays correctly identified values (i.e. of a model or patient-status). The diagonal from the upper-right corner to the lower-left corner shows incorrectly identified (misclassified) values.

Offspring of a confusion matrix

A Wikipedia article on Confusion Matrix nicely summarized the most the important definitions in the image above. Let’s use a confusion matrix above to calculate important metrics “manually”, learn what they mean and recalculate (compute) our results with the statistical software at the end of this blog-post You can skip the explanations and jump directly to the computation part, if you wish.

The “offspring” of a confusion matrix is diverse, thus we’ll split them into 4 categories:

  • the true metrics, which are below the “True condition” columns on the picture above
  • the predicted metrics are to the right of the “Predicted condition” rows on the picture above
  • the likelihood metrics are diagonally right below and
  • the accuracy metrics, such as accuracy, prevalence and others, most of which are not even on the picture

True metrics

Sensitivity and specificity are, in my opinion, the most useful statistics we can get from the confusion matrix.

  1. Sensitivity is the percentage of sick people who were correctly identified. Sensitivity is sometimes called true positive rate (TPR), or recall, or probability of detection, or Power. High sensitivity indicates a small number of FN. Only left column of the confusion matrix is needed to calculate sensitivity:

\[ Sensitivity = TPR = \frac{TP}{TP + FN} = \frac{76}{76 + 2} = 0.97 = 97\%\]

  1. Specificity is the percentage of healthy patients who were correctly identified. Specificity is also called the true negative rate (TNR) or Selectivity. High Specificity indicates a small number of FP. Only right column of the confusion matrix is needed to calculate specificity:

\[ Specificity = TNR = \frac{TN}{TN + FP} = \frac{3}{3 + 19} = 0.136 = 14\%\]

  1. False positive rate (FPR) is the opposite of a true positive rate (TPR) or the opposite of Sensitivity. It is sometimes called the Type I Error, or the probability of false alarm, or Fall-out. FPR is the percentage of healthy people who were incorrectly identified as sick. Only right column of the confusion matrix is needed to calculate FPR:

\[ FPR = 1 - Specificity = \frac{FP}{FP + TN} = \frac{19}{19 + 3} = 0.86 = 86\%\]

Artwork by Allison Horst (twitter: @allison_horst)

  1. False negative rate (FNR) is the opposite of a true negative rate (TNR). Is is sometimes called Type II Error or Miss-rate. FNR is the percentage of sick people who were incorrectly identified as healthy. Only left column of the confusion matrix is needed to calculate FPR:

\[ FNR = 1 - Sensitivity = \frac{FN}{FN + TP} = \frac{2}{2 + 76} = 0.03 = 3\%\]

Artwork by Allison Horst (twitter: @allison_horst)

Type II Error is worse then Type I! Imagine, you were diagnosed with a cancer, but after three additional tests the diagnosis was disproved and you celebrated that you are absolutely healthy. That would be the Type I Error. Scary, but not fatal. Then imagine that your first test was negative, you celebrate, but after some time you suddenly feel bad and three new tests are all cancer-positive. Moreover, due to a erroneous negative first test your cancer progressed irreversibly and you are absolutely going to die. That would be the Type II Error. Even more scary and, literally, a fatal error.

So, what test do we need? A highly sensitive test rarely overlooks an actual positive, or rarely makes a Type II Error but can make a Type I error; a highly specific test rarely registers a positive classification for anything that is not the target of testing, or rarely makes a Type I Error, but can make a Type II error. The ideal test is both highly sensitive and highly specific at the same time, however the sensitivity might be a bit more important.

Predictive metrics

  1. Precision, or Positive predictive value (PPV) is the proportion (probability) of the positive results which were correctly identified. Only upper row of the confusion matrix is needed to calculate PPV:

\[Precision = PPV = 1 - FDR = \frac{TP}{TP + FP} = \frac{76}{76 + 19} = 0.8 = 80\%\]

PPV can also be expressed in terms of sensitivity and specificity:

\[PPV = \frac{Sensitivity}{Sensitivity × (1 − Specificity)}\]

  1. Negative predictive value (NPV) is the proportion (probability) of the negative results which were correctly identified. Only lower row of the confusion matrix is needed to calculate NPV:

\[ NPV = 1 - FOR = NPV = \frac{TN}{TN + FN} = \frac{3}{3 + 2} = 0.6 = 60\%\]

  1. False discovery rate (FDR) is the proportion of the negative (healthy) results which were incorrectly identified as true (sick). Only upper row of the confusion matrix is needed to calculate FDR:

\[ FDR = 1 - Precision = \frac{FP}{TP + FP} = \frac{19}{76 + 19} = 0.2 = 20\%\]

  1. False omission rate (FOR) is the proportion of the positive (sick) results which were incorrectly identified as false (healthy). Only lower row of the confusion matrix is needed to calculate FOR:

\[ FOR = 1 - NPV = \frac{FN}{TN + FN} = \frac{2}{3 + 2} = 0.4 = 40\%\]

Likelihood metrics

  1. Positive likelihood ratio (LR+) is the probability of a person who has the disease testing positive divided by the probability of a person who does not have the disease testing positive. Or simply true positive rate divided by the false positive rate:

\[LR+ = \frac{Sensitivity}{1 - Specificity} = \frac{TPR}{FPR} = \frac{0.97}{0.86} = 1.13\] Positive likelihood ratios are best understood as a unit ratio (a ratio with a denominator of 1). For instance, an LR+ of 3 suggests that for every false positive, there are 3 true positives. The greater the value of the LR+ for a particular test, the more likely a positive test result is a true positive. On the other hand, an LR+ < 1 would imply that an individual with a positive test result is more likely to be non-diseased than diseased.

  1. Negative likelihood ratio (LR−) is the probability of a person who has the disease testing negative divided by the probability of a person who does not have the disease testing negative. Or simply false negative rate divided by the true negative rate:

\[ LR− = \frac{1 - Sensitivity}{Specificity} = \frac{FNR}{TNR} = \frac{0.026}{0.136} = 0.19\]

  1. Diagnostic odds ratio (DOR) is the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease. Or simply positive likelihood ratio divided by the negative likelihood ratio:

\[DOR = \frac{TP/FP}{FN/TN} = \frac{LR+}{LR-} = \frac{1.13}{0.19} = 5.95\]

The rationale for the diagnostic odds ratio is that it is a single indicator of test performance (like accuracy and Youden’s J index, explained below) which is independent of prevalence (unlike accuracy) and is presented as an odds ratio, which is familiar to epidemiologists.

Similarly to a usual odds ratio, the diagnostic odds ratio ranges from zero to infinity, where DOR greater then one is already good, and the higher DOR goes, the better the test performs. DOR of less than one indicates that the test performs bad, or even gives wrong information. And finally, the DOR of 1 means that the test gives no information. Besides the above formula, the diagnostic odds ratio may be expressed in terms of the sensitivity and specificity:

\[{\displaystyle {\text{DOR}}={\frac {{\text{sensitivity}}\times {\text{specificity}}}{\left(1-{\text{sensitivity}}\right)\times \left(1-{\text{specificity}}\right)}}}\]

It may also be expressed in terms of the Positive predictive value (PPV) and Negative predictive value (NPV):

\[{\displaystyle {\text{DOR}}={\frac {{\text{PPV}}\times {\text{NPV}}}{\left(1-{\text{PPV}}\right)\times \left(1-{\text{NPV}}\right)}}}\]

  1. The F1 score can be used as a single measure of performance of the test for the positive class. The F1-score is the harmonic mean of precision (PPV) and sensitivity:

\[{\displaystyle \mathrm {F} _{1}=2\cdot {\frac {\mathrm {PPV} \cdot \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }}} = \frac{2 * 76}{2 * 76 + 19 + 2} = 2* \frac{0.8 * 0.97}{0.8 + 0.97} = 0.88 = 88\%\]

Accuracy metrics

  1. (Diagnostic/Overall) Accuracy shows, how often is the test or classifier correct:

\[ Accuracy = \frac{TP+TN}{total} = \frac{76+3}{100} = 0.79 = 79\%\]

Accuracy sounds great, but it is not always the most desirable metric. For instance, if only 1 out of 1000 patients is cancer positive, your predictive model achieves 99.9% accuracy by predicting all samples to be negative.1

  1. Balanced accuracy (BA) is needed for imbalanced testing sets, where numbers of observations differ in categories:

\[BA = \frac{TPR+TNR}{2} = \frac{0.97+0.14}{2} = 0.56 = 56\%\]

  1. Misclassification Rate or Error Rate shows, how often is the classifier or a test wrong. It is the opposite of accuracy:

\[ Misclassification \ Rate = 1- Accuracy = \frac{FP+FN}{total} = \frac{19+2}{100} = 0.21 = 21\%\] While accuracy sounds more useful then it is, the error rate sounds less useful then it is. The goal of any medical test or a classification model is to minimize the error rate. Thus, this metric is an important performance indicator!

  1. True prevalence shows the percentage of sick individuals (outcome positives). Incidence is the number of new cases that develop during a specified time period. Prevalence is a useful parameter when talking about long lasting diseases, such as HIV, but incidence is more useful when talking about diseases of short duration, such as chickenpox.

\[Prevalence = \frac{TP + FN}{total} = \frac{76 + 2}{100} = 0.78 = 78\% \]

  1. Apparent (Detection) prevalence (AP) is the percentage of test positives.

\[ AP = \frac{TP + FP}{total} = \frac{76 + 19}{100} = 0.95 = 95\% \]

  1. Detection Rate is the percentage of true positives, is in the left upper corner of the confusion matrix.

\[ \frac{TP}{total} = \frac{76}{100} = 0.76 = 76\%\]

  1. Threat score (TS) or Critical Success Index (CSI) ranges from 0 to 1, where 1 represents a perfect prediction. CSI reminds me on \(R^2\) in a linear model, so, for me it is kind of a pseudo \(R^2\) for a classification problem:

\[ CSI = \frac{TP}{TP + FN + FP} = \frac{ 76 }{ 76 + 2 + 19} = 0.78 = 78\%\]

Despite being a very balanced score, CSI works poorly for rare events. A related score, the Equitable Threat Score (ETS) is then preferable:

\[ ETS = \frac{TP - TP_{ebc} }{TP + FP + FN - TP_{ebc}} \]

where \(TP_{ebc}\) is the True Positives expected by chance:

\[TP_{ebc} = \frac{(TP+FP)*(TP+FN)}{n}\] The ETS goes from -0.33 to 1, where negative values mean that model predictions are worse then random guessing.

  1. Matthews correlation coefficient (MCC) is used in machine learning as a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975:

\[ MCC = \frac{TP * TN - FP*FN}{\sqrt{(TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)}} = \frac{76 * 3 - 19*2}{\sqrt{(76+19)*(76+2)*(3+19)*(3+2)}} = 0.21 = 21\%\]

Although the MCC is equivalent to Karl Pearson’s \(\phi\) coefficient, which was developed decades earlier, the term MCC is widely used in the field of bioinformatics and can be calculated using a simple Chi-Square statistics \(\chi^2\):

\[ MCC = \phi = \sqrt{\frac{\chi^2}{n}}\]

The code below proves that MCC and Pearson’s \(\phi\) coefficient are identical:

`chi^2` = chisq.test(mat, simulate.p.value = T)$statistic
n       = sum(mat)
MCC     = sqrt(`chi^2`/n)
## X-squared 
##  21.04496
  1. Youden’s J index or Informedness or Bookmaker Informedness (BM) is a single statistic that captures the performance of a dichotomous diagnostic test (or a binary classifier) by accounting for both false-positive and false-negative rates at the same time, or by summarizing the magnitudes of both types of errors. Youden’s J index estimates the probability of an informed decision and ranges from -1 to +1, where +1 implies that all predictions will be correct, while -1 implies that all predictions will be wrong:

\[ BM = sensitivity + specificity-1 = 0.97 + 0.14 - 1 = 0.11 = 11\%\]

  1. Markedness (MK) shows how trustworthy are the predictions. MK also ranges from -1 to 1 and reminds me somehow on correlation coefficient, with a difference that MK can be applied to a classification problem:

\[MK = PPV + NPV - 1 = 0.8 + 0.6 - 1 = 0.4 = 40\%\]

  1. Null Error Rate: This is how often you would be wrong if you always predicted the majority class. This can be a useful baseline metric to compare your classifier against.

  2. Cohen’s Kappa measures how well the classifier performed as compared to how well it would have performed simply by chance (randomly). Cohen’s Kappa ranges from –1 to +1. The closer Kappa is to 1 the better, the closer to 0 the worse. If Kappa is negative, then the classifier results are worse than random guessing. 😅 It’s funny that something worse than random guessing actually exists.

Besides classification models, Kappa is often used in epidemiology (or medicine in general) to evaluate the agreement between different raters. In this case Kappa is the ratio of the proportion of times that the raters (or predictions) agree to the maximum proportion of times that the raters (or predictions) could agree. Thus, Kappa measures the degree of agreement of the nominal or ordinal ratings made by multiple raters evaluating the same samples.

Cohen’s unweighted Kappa is an index of inter-rater agreement between 2 raters on categorical (or ordinal) data. The higher the value of kappa, the stronger the agreement (or the better the predictions), particularly:

  • Kappa = 1, means perfect agreement exists.
  • Kappa = 0, means agreement by chance.
  • Kappa < 0, means that agreement is weaker than expected by chance.

There are several interpretations of Kappa values. One of the most used (but not the best, because not scientifically justified) is the interpretation by Landis & Koch (1977):

  • <0 No agreement
  • 0 — 0.20 Slight
  • 0.21 — 0.40 Fair
  • 0.41 — 0.60 Moderate
  • 0.61 — 0.80 Substantial
  • 0.81–1 Perfect agreement

The magnitude of Kappa does not tell us whether the Kappa is significant. Thus, there is a Kappa-test, which provides the Standard error, the z-value and of coarse the P-value for Kappa statistics. The Null Hypothesis (\(H_0\)) in this test is: the agreement is due to chance. If P-value ≤ 0.05, the agreement is not due to chance (Reject \(H_0\)). However, the P-values for kappa are rarely reported, because even low Kappas can be significantly different from zero.

For ordinal ratings on a scale of 1–5, Kendall’s coefficients, which account for ordering, are preferred.

Let’s first understand how Cohen’s Kappa can be calculated manually and then compare it to a computed Kappa:

\(p_{Yes}\) the expected probability that both would say yes at random:

\[{\displaystyle p_{\text{Yes}}={\frac {TP+FP}{total}}\cdot {\frac {TP+FN}{total}}} = \frac{76+19}{100} \cdot \frac{76+2}{100} = 0.95 \cdot 0.78 = 0.741\]

\(p_{No}\) the expected probability that both would say no at random:

\[{\displaystyle p_{\text{No}}={\frac {FN+TN}{total}}\cdot {\frac {FP+TN}{total}}} = \frac{2+3}{100} \cdot \frac{19+3}{100} = 0.05 \cdot 0.22 = 0.011\]

Overall random agreement probability \(p_e\) is the probability that raters agreed on either Yes or No, i.e.:

\[p_e = p_{Yes} + p_{No} = 0.741 + 0.011 = 0.752\] Now, we finally can calculate the Cohen’s Kappa itself:

\[Kappa = \frac{accuracy - p_e}{1 - p_e} = \frac{0.79 - 0.752}{1 - 0.752} = 0.1532258 \]

The only problem with Cohen’s Kappa is that it only measures the agreement between 2 raters, while Fleiss’s Kappa between more than 2 raters. Thus…

  1. …for a multinomial logistic regression of other classification models with more then two outcomes, the Fleiss’ Kappa can be used. Fleiss’ Kappa is an index of inter-rater agreement between more then two raters on categorical and ordinal data. The calculations of Fleiss’ Kappa are getting even more complex then Cohen’ Kappa, thus we here will only compute the Exact Fleiss’ Kappa. Additionally, category-wise Kappas can also be computed.


  • ratings n*m matrix or dataframe, n subjects m raters.
  • exact a logical indicating whether the exact Kappa (Conger, 1980) or the Kappa described by Fleiss (1971) should be computed.
  • detail a logical indicating whether category-wise Kappas should be computed. It shows which ratings (categories) are the raters agree the most upon. For instance if the movies are rated on the scale from 0 to 5, 5 being the best, the results of the test show that raters agree the most (highest Kappa of 0.16) about the best movies:
kappam.fleiss(video)              # Fleiss' Kappa
##  Fleiss' Kappa for m Raters
##  Subjects = 20 
##    Raters = 4 
##     Kappa = 0.0357 
##         z = 0.531 
##   p-value = 0.596
kappam.fleiss(video, exact =TRUE) # Exact Kappa
##  Fleiss' Kappa for m Raters (exact value)
##  Subjects = 20 
##    Raters = 4 
##     Kappa = 0.0951
kappam.fleiss(video, detail=TRUE) # Fleiss' and category-wise Kappa
##  Fleiss' Kappa for m Raters
##  Subjects = 20 
##    Raters = 4 
##     Kappa = 0.0357 
##         z = 0.531 
##   p-value = 0.596 
##    Kappa      z p.value
## 2 -0.026 -0.281   0.779
## 3 -0.010 -0.113   0.910
## 4  0.031  0.345   0.730
## 5  0.159  1.744   0.081

How to compute Confusion Matrix and it’s 25 offspring

Despite the simplicity of calculations of most of the metrics, we sometimes need to do most of them multiple times, e.g. for different tests or for different classifiers, which can be error prone. Thus, using statistical software is often a faster and more secure choice. Several R packages are able to conduct the confusion matrix-analysis:

Load all needed packages at once to avoid interruptions.

library(tidyverse)   # data wrangling and visualization
library(knitr)       # beautifying tables
library(ISLR)        # for Wage dataset 
library(caret)       # machine learning library, for CM in this post
library(readxl)      # for reading excel files
library(epiR)        # for confusion matrix-analysis
library(mltools)     # for calculating Matthews correlation coefficient (MCC)
library(fmsb)        # for Kappa Test
library(irr)         # for Fleiss Kappa

Now, take two columns, where first column contains model predictions and the second column contains known values (or gold standard of some test), and produce your own confusion matrix with a code below:

mat <- table(d$first_column, d$second_column)
bla <- epi.tests(mat)

My artificially created data and already familiar to you confusion matrix produce following results:

##           Outcome +    Outcome -      Total
## Test +           76           19         95
## Test -            2            3          5
## Total            78           22        100
## Point estimates and 95 % CIs:
## ---------------------------------------------------------
## Apparent prevalence                    0.95 (0.89, 0.98)
## True prevalence                        0.78 (0.69, 0.86)
## Sensitivity                            0.97 (0.91, 1.00)
## Specificity                            0.14 (0.03, 0.35)
## Positive predictive value              0.80 (0.71, 0.88)
## Negative predictive value              0.60 (0.15, 0.95)
## Positive likelihood ratio              1.13 (0.95, 1.34)
## Negative likelihood ratio              0.19 (0.03, 1.06)
## ---------------------------------------------------------

Moreover, a summary of the object created by the amazing epiR package provides most of the metrics we have learned in this post at once! Particularly:

  • aprev - apparent prevalence.
  • tprev - true prevalence.
  • se - test sensitivity.
  • sp - test specificity.
  • diag.acc - diagnostic accuracy.
  • diag.or - diagnostic odds ratio.
  • nnd - number needed to diagnose.
  • youden - Youden’s index.
  • ppv - positive predictive value.
  • npv - negative predictive value.
  • plr - likelihood ratio of a positive test.
  • nlr - likelihood ratio of a negative test.
  • pro - the proportion of subjects with the outcome ruled out.
  • pri - the proportion of subjects with the outcome ruled in.
  • pfp - of all the subjects that are truly outcome negative, the proportion that are incorrectly classified as positive (the proportion of false positives).
  • pfn - of all the subjects that are truly outcome positive, the proportion that are incorrectly classified as negative (the proportion of false negative).
##                 est         lower       upper
## aprev    0.95000000   0.887165089  0.98356812
## tprev    0.78000000   0.686080346  0.85669642
## se       0.97435897   0.910426673  0.99687953
## sp       0.13636364   0.029055851  0.34912210
## diag.acc 0.79000000   0.697084621  0.86505630
## diag.or  6.00000000   0.935457772 38.48383227
## nnd      9.03157895 -16.524152457  2.89015984
## youden   0.11072261  -0.060517476  0.34600162
## ppv      0.80000000   0.705428645  0.87507901
## npv      0.60000000   0.146632800  0.94725505
## plr      1.12820513   0.951921299  1.33713450
## nlr      0.18803419   0.033485837  1.05587492
## pro      0.05000000   0.016431879  0.11283491
## pri      0.95000000   0.887165089  0.98356812
## pfp      0.86363636   0.650877903  0.97094415
## pfn      0.02564103   0.003120472  0.08957333

The caret package by the Max Kuhn give us the rest:

## Confusion Matrix and Statistics
##              1. positiv 2. negativ
##   1. positiv         76         19
##   2. negativ          2          3
##                Accuracy : 0.79            
##                  95% CI : (0.6971, 0.8651)
##     No Information Rate : 0.78            
##     P-Value [Acc > NIR] : 0.4608927       
##                   Kappa : 0.1532          
##  Mcnemar's Test P-Value : 0.0004803       
##             Sensitivity : 0.9744          
##             Specificity : 0.1364          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.6000          
##              Prevalence : 0.7800          
##          Detection Rate : 0.7600          
##    Detection Prevalence : 0.9500          
##       Balanced Accuracy : 0.5554          
##        'Positive' Class : 1. positiv      

The mltools package helps to compute the Matthews correlation coefficient and the fmsb package conducts the Kappa-test and gives us both confidence intervals and P-value for the Kappa statistics:

mltools::mcc(TP = 79, FP = 19, FN = 2, TN = 3)
## [1] 0.2129615
Kappa.test(mat) # or simply kappa(mat)
## $Result
##  Estimate Cohen's kappa statistics and test the null hypothesis that
##  the extent of agreement is same as random (kappa=0)
## data:  mat
## Z = 0.87993, p-value = 0.1894
## 95 percent confidence interval:
##  -0.1686732  0.4751248
## sample estimates:
## [1] 0.1532258
## $Judgement
## [1] "Slight agreement"

Conclusion, or which performance score is better?

So many indicators of model-performance can be overwhelming and confusing! Are they all good? Of coarse not. Are they all needed? Of coarse yes. They simply answer different questions. However, some of them can still be compared in terms of their performance.

As mentioned above, the overall accuracy is not that reliable. The F1 score can be even worse. Informedness (Youden’s J index) and Markedness supposed to be better (this is only! my opinion), since they account for both types of errors. According to Davide Chicco and Giuseppe Jurman, the most informative metric to evaluate a confusion matrix is the Matthews correlation coefficient (MCC) 2. By considering the proportion of each class of the confusion matrix in its formula, its score is high only if your classifier is doing well on both the negative and the positive elements. In our example, the value of the MCC is be 0.21, which is slightly better then random guessing. However, low MCC is also very useful, since it alarms us that our model performs poorly. Thus, MCC seems to be the best metric for evaluating an confusion matrix.3 However, when it comes to measuring the performance of the classification model (or a medical test), the absolute “King” metric is the AUROC - Area Under the Receiver Operating Characteristics, which is so huge, that it deserves an extra dedicated blog-post. Thus…

What’s next

  • watch two linked below and amazing videos by Josh Starmer on confusion matrix
  • have a look at my lecture, which connects confusion matrix with the ROC curve … by the way, the lecture was created by the cool xaringan package by Yihui Xie and
  • definitely read my next post on ROC curve

If you think, I missed something, please comment on it, and I’ll improve this tutorial.

Thank you for learning!

Further readings and references

  1. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer.



Yury Zablotski
Data Scientist at LMU Munich, Faculty of Veterinary Medicine

Passion for applying Biostatistics and Machine Learning to Life Science Data


comments powered by Disqus