Simulation analysis with imbalanced data

The ROC plot is a popular evaluation measure for binary classifiers. ROC can be misleading when applied to imbalanced datasets. In our simulation analysis, we aim to reveal the difference between ROC and precision-recall and potential issues caused by wrong interpretation of the plots.

If you find the thoughts presented here convincing, please consider citing our PLOS ONE article in your next publication (link). For some interesting citations, see this blog post: link.

Contents

This section contains two pages.

Method section

We introduce the method of the simulation analysis. We explain how to randomly generate samples with several different performance levels for both balanced and imbalanced scenarios .

For those who are interested in how to generate the samples, the method section is a good introduction to random sampling.

For those who are mainly interested in the difference between ROC and precision-recall under balanced and imbalanced scenarios, we recommend skipping the method section and starting directly from the result section.

Results section

We summarize the results of the simulations and discuss the difference between ROC and precision-recall in the result section.

Leave a comment