Classifier evaluation with imbalanced datasets

A large number of bioinformatics studies are based on classification models. For instance, a classification model can be used to detect potential cancer patients from their blood samples. Performance evaluation of such model is critical to decide the most effective parameters and also to compare multiple models with the same functionality. The Receiver Operating Characteristics (ROC) plot has been routinely used to evaluate such classification models.

The ROC plot was originally developed by electrical engineers during World War II to detect enemy objects from their radar signals. It has been used in a wide range of fields including life sciences since then. Because of its popularity, the characteristics of the plot have been well studied. For instance, one of the potential disadvantages of the ROC plot is that it can be misleading when applied to strongly imbalanced datasets.

Many datasets in life sciences are naturally imbalanced. Nonetheless, the ROC plot has been the most-widely used evaluation measure even when the dataset is strongly imbalanced. Here, we reveal several potential issues related with imbalanced datasets and also show the advantages of the precision-recall plot, which is an alternative evaluation measure, over the ROC plot.

Our motivation and 3 goals

Our main motivation is that we like to share the information about the appropriate and effective usage of the ROC and the precision-recall plots. Specifically, our main message is that the precision-recall plot is more informative than the ROC plot when applied to imbalanced datasets.

To achieve the main goal, we have set up three sub-goals, and they are to:

publish a peer-reviewed article with several analyses to clarify the benefit of the precision-recall plot over the ROC plot,
create a website that provides an easy introduction and the summary of our analyses, and
develop an open source program that can be used to generate the ROC and the precision-recall plots accurately.

We have achieved the first goal by having an article published with PLOS ONE.

[saito2015a] Takaya Saito and Marc Rehmsmeier (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 10(3):e011843

Subsequently, this web site has been created to achieve our second goal. Although the most content of the site is based on our paper, we have created over 80 additional images to enhance the accessibility of site content. Our aim is to make all pages in this site easily accessible even for those without bioinformatics background.

To achieve the third goal, we have developed a CRAN library called Precrec that calculates fast and accurate precision-recall curves and also have an article published with Bioinformatics.

[Saito2016] Takaya Saito and Marc Rehmsmeier (2016) Precrec: fast and accurate precision-recall and ROC curve calculations in R. Bioinformatics.

Scope of our project

The scope of our project for model evaluation is spread over several different areas, but we mainly focus on binary classifiers and imbalanced datasets.

Binary classifiers

A classifier is a model that outputs classes for given input data. Classes can be some type of groups with certain similarities. For instance, let us assume there are three classes A, B, and C. Then, the class members of A should have some similar characteristics among each other, but they should be different from the members of B and C.

Classification models or classifiers automatically detect or predict a class for a given data instance. A classification model is usually developed by computational and statistical approaches, such as by using machine learning techniques.

A classification model with input and output data. — A classifier can be seen as a black box that takes certain type of input data and produces output data with classes.

Binary classifier is the most common classification model, which produces an output vector with only two classes. Classes can be any two distinct values or labels. The class of interest is usually denoted as “positive” and the other as “negative”.

Output data of a binary classifier. — A binary classifier produces an output vector with two class values. Class values can be any kind of two distinct labels, such as positive and negative.

An input dataset usually consists of multiple feature vectors. One feature vector corresponds to one class value. A feature vector contains multiple feature values, which are input variables and also called explanatory variables.

Imbalanced datasets

We assume that a dataset is imbalanced when it matches the following three criteria.

The number of negatives overweights the number of positives
At least 2-fold negatives
Large or medium sample size (>200 instances/observations)

We eliminate datasets with small sample size because evaluation of the classifiers with small sample size usually requires some other specific approaches.

Example of balanced or imbalanced datasets. — A dataset can be balanced or imbalanced depending on the ratio of positives and negatives.

Summary of the project scope

Through this web site, we mainly consider binary classifiers with imbalanced datasets, in which the number of negatives overweights the number of positives significantly.

The site contains 13 main pages together with some additional pages. We try to make all pages independent from each other, so hopefully, you can understand the content even if you start reading from any page you like.

Our top 3 recommended pages

We recommend the following three pages for those who are familiar with the ROC analysis.

Additional pages

We currently have two additional pages. We may add more pages if necessary.

Our Blog posts

Whenever we add a new blog post it will appear in the BLOG page. We make a new post for our major updates or when we find something relevant and important to our project.

BLOG

Contact form

We love to hear your feedback. We have set up a simple contract form so that you can send us a message.

Contact

2 thoughts on “About this site”

Final Grade Calculator says:

May 12, 2018 at 10:18 pm

Pretty! This was a really wonderful post. Thanks for providing this info.

LikeLike

pradeep says:

April 27, 2019 at 9:26 am

Dear sir

Information is very useful and easy to understand please provide for all machine learning models like class evaluation

LikeLike

About this site

Our motivation and 3 goals