A large number of bioinformatics studies are based on classification models. For instance, a classification model can be used to detect potential cancer patients from their blood samples. Performance evaluation of such model is critical to decide the most effective parameters and also to compare multiple models with the same functionality. The Receiver Operating Characteristics (ROC) plot has been routinely used to evaluate such classification models.
The ROC plot was originally developed by electrical engineers during World War II to detect enemy objects from their radar signals. It has been used in a wide range of fields including life sciences since then. Because of its popularity, the characteristics of the plot have been well studied. For instance, one of the potential disadvantages of the ROC plot is that it can be misleading when applied to strongly imbalanced datasets.
Many datasets in life sciences are naturally imbalanced. Nonetheless, the ROC plot has been the most-widely used evaluation measure even when the dataset is strongly imbalanced. Here, we reveal several potential issues related with imbalanced datasets and also show the advantages of the precision-recall plot, which is an alternative evaluation measure, over the ROC plot.
Our motivation and 3 goals
Our main motivation is that we like to share the information about the appropriate and effective usage of the ROC and the precision-recall plots. Specifically, our main message is that the precision-recall plot is more informative than the ROC plot when applied to imbalanced datasets.
To achieve the main goal, we have set up three sub-goals, and they are to:
- publish a peer-reviewed article with several analyses to clarify the benefit of the precision-recall plot over the ROC plot,
- create a website that provides an easy introduction and the summary of our analyses, and
- develop an open source program that can be used to generate the ROC and the precision-recall plots accurately.
We have achieved the first goal by having an article published with PLOS ONE.
[saito2015a] Takaya Saito and Marc Rehmsmeier (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 10(3):e011843
Subsequently, this web site has been created to achieve our second goal. Although the most content of the site is based on our paper, we have created over 80 additional images to enhance the accessibility of site content. Our aim is to make all pages in this site easily accessible even for those without bioinformatics background.
To achieve the third goal, we have developed a CRAN library called Precrec that calculates fast and accurate precision-recall curves and also have an article published with Bioinformatics.
[Saito2016] Takaya Saito and Marc Rehmsmeier (2016) Precrec: fast and accurate precision-recall and ROC curve calculations in R. Bioinformatics.
Scope of our project
The scope of our project for model evaluation is spread over several different areas, but we mainly focus on binary classifiers and imbalanced datasets.
A classifier is a model that outputs classes for given input data. Classes can be some type of groups with certain similarities. For instance, let us assume there are three classes A, B, and C. Then, the class members of A should have some similar characteristics among each other, but they should be different from the members of B and C.
Classification models or classifiers automatically detect or predict a class for a given data instance. A classification model is usually developed by computational and statistical approaches, such as by using machine learning techniques.
Binary classifier is the most common classification model, which produces an output vector with only two classes. Classes can be any two distinct values or labels. The class of interest is usually denoted as “positive” and the other as “negative”.
An input dataset usually consists of multiple feature vectors. One feature vector corresponds to one class value. A feature vector contains multiple feature values, which are input variables and also called explanatory variables.
We assume that a dataset is imbalanced when it matches the following three criteria.
- The number of negatives overweights the number of positives
- At least 2-fold negatives
- Large or medium sample size (>200 instances/observations)
We eliminate datasets with small sample size because evaluation of the classifiers with small sample size usually requires some other specific approaches.
Summary of the project scope
Through this web site, we mainly consider binary classifiers with imbalanced datasets, in which the number of negatives overweights the number of positives significantly.
The site contains 13 main pages together with some additional pages. We try to make all pages independent from each other, so hopefully, you can understand the content even if you start reading from any page you like.
- About this site
- Introduction to evaluation measures
- Simulation analysis with imbalanced data
- Literature analysis on classifier evaluation
- Re-analysis of a previous study with imbalanced data
- Tools for ROC and precision-recall
Our top 3 recommended pages
We recommend the following three pages for those who are familiar with the ROC analysis.
- Introduction to the precision-recall plot
- ROC and precision-recall with imbalanced datasets
- Re-analysis of a previous study with imbalanced data
We currently have two additional pages. We may add more pages if necessary.
Our Blog posts
Whenever we add a new blog post it will appear in the BLOG page. We make a new post for our major updates or when we find something relevant and important to our project.
We love to hear your feedback. We have set up a simple contract form so that you can send us a message.
2 thoughts on “About this site”
Pretty! This was a really wonderful post. Thanks for providing this info.
Information is very useful and easy to understand please provide for all machine learning models like class evaluation