 # ROC and precision-recall with imbalanced datasets

Although the ROC plot can be misleading when applied to strongly imbalanced datasets, it is still widely used to evaluate binary classifiers despite its potential disadvantage.

The goal of the analysis is to clarify the difference between ROC and precision-recall by simulating under various conditions. Our simulation result clearly suggests that the precision-recall plot is more informative than the ROC plot when applied to imbalanced datasets.

## Preparation for the simulation

We briefly explain the method of the simulation here. Please see the Method of the simulation page for more details.

First, we defined the following five performance levels by combining two probability distributions – one for positive and the other for negative scores.

• Random
• Poor early retrieval
• Good early retrieval
• Excellent
• Perfect

Secondly, we generated datasets randomly by sampling scores from the score distributions of a specific performance level. The datasets are regarded as “balanced” when they consist of 1000 positives and 1000 negatives, whereas “imbalanced” for 1000 positives and 10 000 negatives.

# of positives # of negatives
Balanced 1000 1000
Imbalanced 1000 10 000

In total, we made 10 distinct types of samples with 5 different levels under both balanced and imbalanced scenarios.

Finally, we iterated a set of random sample creation 1000 times and calculated evaluation measures for each iteration. Subsequently, we created evaluation graphs by plotting the average curves calculated from all iterations.

## Analysis of the simulation

We used three different plots – ROC, Concentrated ROC (CROC) (Swamidass2010), and precision-recall – for our simulation analysis. CROC is useful to analyse the early retrieval area of the ROC plot, which is the region with high specificity.

### ROC plots are unchanged between balanced and imbalanced datasets

ROC curves appeared to be identical under balanced and imbalanced cases. Two ROC plots show the same curves despite of different positive and negative ratios. Both plots have five curves with different performance levels.

#### Comparison of AUC scores between balanced and imbalanced

Accordingly, the AUC (area under the ROC curve) scores are also unchanged. The scores also indicate the same performance (both 0.8) for the poor early retrieval and the good early retrieval levels.

Balanced Imbalanced
Random 0.5 0.5
Poor early retrieval 0.8 0.8
Good early retrieval 0.8 0.8
Excellent 0.98 0.98
Perfect 1 1

#### Comparison of ROC points between balanced and imbalanced

We use two ROC points to reveal the difference of the actual interpretation between balanced and imbalanced cases. We selected the point (0.16, 0.5) on the curve of the poor retrieval level for the comparison.

##### Balanced case

The point for the balanced case represents specificity 0.84 and sensitivity 0.5. It can also be represented as 500 true positives (1000 * 0.5) and 160 false positives (1000 * 0.16). This classifier is likely considered as a good classifier if this point is used for evaluation. A ROC curve (blue) represents the performance of a classifier with the poor early retrieval level for the balanced case. A point (red circle) is selected for comparison.
##### Imbalanced case

The same point for the imbalanced case also represents specificity 0.84 and sensitivity 0.5. Nonetheless, it represents 500 true positives (1000 * 0.5) and 1600 false positives (10 000 * 0.16). This classifier is likely considered as a poor classifier if this point is used for evaluation, but it is difficult to reach this conclusion from directly analysing the ROC curve and the AUC score. A ROC curve (blue) represents the performance of a classifier with the poor early retrieval level for the imbalanced case. A point (red circle) is selected for comparison.

#### Early retrieval area of ROC

The early retrieval area of the ROC plot is often used as an additional evaluation approach. It is useful to evaluate high-ranked instances. For instance, the ROC curve of the good retrieval level shows better performance than the poor retrieval level in this area. The early retrieval area (red rectangle) is the region with high specificity in the ROC space. The ROC curve of the good early retrieval (green) outperforms than that of the poor early retrieval (blue) in the area.

#### Summary of the ROC simulation

The ROC curves fail to explicitly show the difference between balanced and imbalanced cases. Moreover, the AUC (ROC) scores are inadequate to evaluate the early retrieval performance especially when curves are crossing each other. In summary, ROC requires a special caution when it is applied to imbalanced datasets and the early retrieval performance needs to be checked.

### The CROC plots are also unchanged between balanced and imbalanced datasets

The concentrated ROC (CROC) developed by Swamidass et al. (Swamidass2010) is useful to evaluate higher ranked instances. It is specialized to analyse the early retrieval area by using a magnifier function on the x-axis of a ROC plot. The x-axis of the CROC is a transformed false positive rate f(FPR) instead of FPR or (1 – specificity) and the y-axis is sensitivity.

#### Magnifier function: exponential function with α = 7

We used an exponential function f(x) = (1 – exp(-αx)) / (1 -exp(-α)) and α = 7 for our simulations. This function expand the x-axis of a ROC plot. For instance, 0.1 is transformed to 0.504, 0.2 to 0.754, 0.3 to 0.873, and so on and so forth. An exponential function expands false positive rates (FPR) to f(FPR) values. The function used in our simulations is f(x) = (1 – exp(-αx)) / (1 -exp(-α)) where α = 7.

#### Comparison of CROC curves between balanced and imbalanced

CROC curves are also unchanged under balanced and imbalanced scenarios. Nonetheless, the curve for the good early retrieval level outperforms that of the poor early retrieval area in a wide range of f(FPR). Two CROC plots show the same curves despite of different positive and negative ratios. Both plots have five curves with different performance levels under balanced and imbalanced scenarios.

#### Comparison of AUC scores between balanced and imbalanced

Accordingly, all AUC (CROC) scores are unchanged between balanced and imbalanced datasets. Nonetheless, the AUC of the good retrieval level (0.56) is better than that of the poor early retrieval area (0.39).

Balanced Imbalanced
Random 0.14 0.14
Poor early retrieval 0.39 0.39
Good early retrieval 0.56 0.56
Excellent 0.92 0.92
Perfect 1 1

#### Comparison of CROC points between balanced and imbalanced

We use two CROC points to clarify the difference of the actual interpretation between balanced and imbalanced cases. We selected the point (0.67, 0.5) on the curve of the poor retrieval level for the comparison. False positive rate is approximately 0.16 when f(FPR) is 0.67.

##### Balanced case

The point for the balanced case represents 500 true positives (1000 * 0.5) and 160 false positives (1000 * 0.16). This classifier is likely considered as a good classifier if this point is used for evaluation. A CROC curve (blue) represents the performance of a classifier with the poor early retrieval level for the balanced case. A point (red circle) is selected for comparison.
##### Imbalanced case

The same point for the imbalanced case also represents specificity 0.84 and sensitivity 0.5. Nonetheless, it represents 500 true positives (1000 * 0.5) and 1600 false positives (10 000 * 0.16). This classifier is likely considered as a poor classifier if this point is used for evaluation, but, again, it is difficult to reach this conclusion from directly analysing the CROC curve and the AUC score. A CROC curve (blue) represents the performance of a classifier with the poor early retrieval level for the imbalanced case. A point (red circle) is selected for comparison.

#### Optimized magnifier function

The original paper of the CROC plot shows several examples of magnifier functions and parameters. For instance, the parameter of the exponential function can be 7, 14, or 80. Nevertheless, the most optimized function and the parameter are usually unknown.

α FPR Approximate f(FPR)
7 0.1 0.5
14 0.05 0.5
80 0.0085 0.5

#### Summary of the CROC simulation

The CROC plot is useful to evaluate the performance in the early retrieval area by expanding it by a magnifier function. Nonetheless, it has the same interpretation issues with the ROC plot when applied to imbalanced datasets. Moreover, the optimized magnifier function and its parameter are usually unknown.

### The precision-recall plots are different between balanced and imbalanced datasets

In contrast to ROC and CROC plots, the precision-recall plots appear to be different between balanced and imbalanced datasets. Moreover, the curve for the good early retrieval level outperforms that of the poor early retrieval area in a wide range of recall values. Two precision-recall plots show different curves for different positive and negative ratios. Both plots have five curves with different performance levels under balanced and imbalanced scenarios.

#### Comparison of AUC scores between balanced and imbalanced

Accordingly, all AUC (precision-recall) scores are different between balanced and imbalanced datasets. Moreover, the AUC scores of the good retrieval level are better than those of the poor early retrieval area for both balanced (0.84 vs. 0.74) and imbalanced (0.51 vs. 0.23) scenarios.

Balanced Imbalanced
Random 0.5 0.09
Poor early retrieval 0.74 0.23
Good early retrieval 0.84 0.51
Excellent 0.98 0.9
Perfect 1 1

#### Comparison of precision-recall points between balanced and imbalanced

We use two precision-recall points to clarify the difference of the actual interpretation between balanced and imbalanced cases. We selected two points (0.75, 0.5) and (0.25, 0.5) on the curve of the poor retrieval level for this comparison.

##### Balanced case

The point for the balanced case represents sensitivity 0.5 and precision 0.75. It can also be represented as 500 true positives (1000 * 0.5) and 75% correct positive predictions. This classifier is likely considered as a good classifier if this point is used for evaluation. A precision-recall curve (blue) represents the performance of a classifier with the poor early retrieval level for the balanced case. A point (red circle) is selected for comparison.
##### Imbalanced case

The precision-recall curve appears to be changed under the imbalanced scenario. A point with the same recall 0.5 represents sensitivity 0.5 and precision 0.25. It can also be represented as 500 true positives (1000 * 0.5) and 25% correct positive predictions. This classifier is likely considered as a poor classifier if this point is used for evaluation, and it matches the actual interpretation from analysing the precision-recall curve and the AUC score. A precision-recall curve (blue) represents the performance of a classifier with the poor early retrieval level for the imbalanced case. A point (red circle) is selected for comparison.

#### Summary of the precision-recall simulation

The precision-recall plot is able to show the performance difference between balanced and imbalanced cases. It is also useful to reveal the performance of high ranking instances.

## Conclusion

The ROC plot is a popular and powerful measure to evaluate binary classifiers. Nonetheless, it has some limitations when applied to imbalanced datasets. Other plots, such as CROC and precision-recall, are less frequently used than ROC. Our simulation analysis indicates that only the precision-recall plot changes depending on the ratio of positives and negatives, and it is also more informative than the ROC plot when applied to imbalanced datasets.

## Relevant analyses

We have two more analyses regarding the difference between the ROC and the precision-recall plots. Please see the following pages.

## 7 thoughts on “ROC and precision-recall with imbalanced datasets”

1. Pingback: The confusion matrix visualized – Yakanak News