Citation update

We are happy to report that, two years after publication, our PLoS One article (link) has been cited 58 times so far, according to Google Scholar (link). Citing articles are from a variety of topics, including non-bioinformatics ones.

Some bioinformatic articles citing us:

Reference standards for next-generation sequencing

The authors say: “For data sets with unbalanced classes, ROC curves should be interpreted cautiously because a small change in the false-positive rate can lead to a large change in the absolute number of erroneous predictions [Saito and Rehmsmeier 2015; MR]. A precision-recall curve can be more useful for resolving diagnostic thresholds in next-generation sequencing tests, in which the true-negative class is usually far larger than the true-positive class” (my annotations indicated by MR)

The authors say: “Although the differences between the methods as measured by the AUROC are not very large (even though in several cases the differences are statistically significant according to the DeLong test […]), it is well-known that with imbalanced data the AUPRC is more informative than the AUROC [Saito and Rehmsmeier 2015, Davis and Goadrich 2006; MR]” (my annotations indicated by MR)

Epigenomics-Based Identification of Major Cell Identity Regulators within Heterogeneous Cell Populations

The authors say: “Importantly, precision (and AUPR) is the most reliable performance measure when there is a highly skewed distribution of class (Ackermann et al., 2012 ;  Saito and Rehmsmeier, 2015).”


Some non-bioinformatic articles:

Research Dilemmas with Behavioral Big Data

The authors say: “Because BBD studies often involve imbalanced binary outcomes (modeling differences between a small minority and a majority), the choice of metrics should be suitable for such situations. Saito and Rehmsmeier60 show that ROC curves are insensitive to the imbalance ratio, while precision-recall charts do changes with the imbalance ratio.”

The authors say: “Indeed, as “non-outbreaks” are by far more frequent than outbreaks, researchers must deal with what are called imbalanced datasets. In these situations, it has been shown that precision-recall plots, based precisely on VPP [= PPV; VPP seems to be the French version; MR] and sensitivity, give accurate information about classification performance and are more informative than the traditional ROC curves, based on sensitivity and specificity [Saito and Rehmsmeier 2015; MR].” (my annotations indicated by MR)

Development and validation of an electronic medical record-based alert score for detection of inpatient deterioration outside the ICU

This article uses the “work-up to detection ratio (W:D)” which is the inverse of the positive predictive value (PPV).

The authors say: “However, using these measures [the common ones, not PPV; MR] presents a number of problems when one tries to predict extremely rare outcomes [Saito and Rehmsmeier 2015; MR]. The most important of these is that a classifier that tries to maximize the accuracy of its classification rule when predicting a rare outcome may obtain an accuracy of 99% just by classifying all observations as non-events, and model improvements (e.g. as quantified by increases in the c statistic) are overshadowed by the large true negative rate.” (my annotations indicated by MR)

Bank Insolvency Risk and Z-Score Measures: Caveats and Best Practice

The authors say: “When positive outcomes are sparse, it becomes easier to predict negative outcomes and, consequently, true negatives are much more numerous than false positives (i.e. the false positive rate is generally weak). Therefore, if we care more about the positive outcomes, the AUROC curve can overstate the overall performance of the classifier. The area under the precision-recall (AUPR) curve can be considered as a more appropriate evaluation metric in such situations (Saito and Rehmsmeier (2015)).”



Our article has been recommended on Faculty of 1000

Our PLoS One article (link) has been recommended on Faculty of 1000 (link) by Michael Barnes and David Watson from Queen Mary University of London, London, UK. From the recommendation:

“The authors make a compelling case that when data is heavily skewed, as it is in most bioinformatic contexts, the widely used receiver operating characteristic (ROC) curve and its associated metric AUC (area under the ROC curve) can be severely misleading.”


Our first BLOG entry

Hooray! We have launched our new website! The site consists of 13 main pages that explain important aspects of  performance evaluation for binary classifiers. Below are the links for the main pages.

We hope you find our website useful and informative. Please use our Contact page for any questions. We’d love to hear your feedback on our site.