Re-analysis of a previous study with imbalanced data

We show the potential difference of interpretation between ROC and precision-recall by re-analysing a previously published study in this page. The study we selected is a microRNA gene identification study that uses a binary classifier on imbalanced datasets (Huang2007). We refer to it as the miRFinder study from the name of the algorithm. This study also uses the ROC plot as its main performance evaluation measure to compare with several other classifiers.

The main goal of this analysis is to reveal whether the conclusion of the study is changed when the classifiers are evaluated with precision-recall instead of ROC.

microRNA gene discovery

MircoRNA (miRNA) is a class of small RNAs that have important regulatory roles in plants and animals. Although many bioinformatics algorithms have been proposed to identify miRNA genes from the whole genome-wide sequences, it is still challenging to predict correct miRNA genes with high accuracy.

MicorRNAs are known to form a distinct hairpin structure after transcription. Thousands of the hairpin structures are likely found in a genome-wide analysis, but only a small fraction of them are actual miRNA genes. Hence, the datasets used in miRNA gene discovery studies tend to be strongly imbalanced as the number of positives (true miRNA genes) is much fewer than the number of negatives (miRNA candidates or pseudo-miRNA genes).

Hairpin strucutre of hsa-mir-181a.
A part of the hairpin structure of a human miRNA gene hsa-mir-181a. The original output image of RNAFold (Hofacker1994) was modified for this partial structure.

5 algorithms of miRNA gene discovery

We compared miRFinder with four other algorithms. Three of them – miPred, RNAmicro, and ProMiR – are the algorithms that are specialized to identify miRNA genes. The remaining one is a tool called RNAfold that predicts RNA secondary structure by minimizing over thermodynamic free energy. It can be used to predict a hairpin structure, and the majority of miRNA gene discovery algorithm integrate RNAfold to calculate the minimum free energy (MFE).

All of them can produce scores, and therefore it is possible to create ROC and precision-recall curves.

Test datasets T1 and T2

We used two test datasets to evaluate the five algorithms with ROC and precision-recall and labelled these datasets as T1 and T2.

T1: miRBase miRNA genes and a SW algorithm

In the miRFinder study (Huang2007), the positive records are retrieved from miRBase, which is the central database of actual miRNA genes. For the negative records, a Smith-Waterman-like algorithm is used to find hairpin structures in evolutionary conserved regions between human and mouse.

We downloaded the dataset provided by the miRFinder study and filtered it to create a non-redundant dataset. The generated dataset named T1 has 819 positives and 11 060 negatives.

T1 test dataset with 819 positives and 11 060 negatives.
The T1 dataset contains 819 positives and 11 060 negatives. The positives are actual miRNA genes from the miRBase database. The negatives are generated by a SW algorithms.

T2: Hairpin structures in the C.elegance genome

We added one more test dataset to evaluate the same algorithms under a different condition. We used the method described in the RNAmicro study (Hertel2006) to create a dataset for miRNA genes of C. elegans.

First, we used a tool called RNAz to identify conserved sites that form a hairpin structure in C. elegans. We then downloaded C. elegans miRNA genes from the miRBase and defined the positive records that were found both in RNAz and miRBase. The negative records were then defined as the regions found by RNAz but not by miRBase.

T2 test dataset with 111 positives and 13 444 negatives.
The T2 dataset contains 111 positives and 13 444 negatives. The positives are candidate genes found by both RNAz and miRBase. The negatives are candidate genes found only by RNAz.

Links for the resources (external sites)

3 different scenarios of re-analysis

We prepared three different cases to re-analyse the study and denoted them as follows.

  • RA1: The method used in the miRFinder study (a ROC curve and ROC points) on T1
  • RA2: ROC and precision-recall on T1
  • RA3: ROC and precision-recall on T2

RA1: The evaluation method of the original study

We used the same evaluation method used in the miRFinder study to discuss the potential issues related to this evaluation approach. The method is based on one ROC curve and several ROC points. It is a common method used in many studies, but the comparison is usually invalid since the curve and the points are generated from all different datasets.

A ROC curve on T1 and ROC points

With this evaluation method, only one ROC curve was created, which was for miRFinder, and three ROC points were added for other algorithms – miPred, RNAmicro, and ProMiR.

A ROC curve and three ROC points for four different algorithms.
A ROC curve (red) for miRFinder and three ROC points for miPred (blue), RNAmicro (green), and ProMiR (purple).

Specificity and sensitivity values were retrieved from their own original studies for miPred and ProMir. The values were estimated from a figure in the miRFinder study for RNAmicro.

Specificity 1 – Specificity Sensitivity
miPred 0.98 0.02 0.95
RNAmicro 0.9 0.1 0.9
ProMir 0.96 0.04 0.73

Mismatches between ROC points and the corresponding ROC curves on T1

There can be several reasons to use ROC points instead of ROC curves. For instance, tools are not freely available, or classifiers produce no scores. Nevertheless, the curve and the points should be created from the same dataset for meaningful comparison in any case.

In the case of this analysis, the three ROC points were not generated from T1. We added three ROC curves generated from T1 for comparison. The result clearly indicates none of the three ROC points are on the corresponding ROC curves of the same algorithm.

Three ROC points and the corresponding ROC curves on T1.
Three ROC points for miPred (blue), RNAmicro (green), and ProMiR (purple). ROC curves are created for the same algorithms with T1.

Summary of the original evaluation method

The method is based on one ROC curve and several ROC points. They were all generated from different datasets, and therefore, the comparison was simply invalid.

RA2: Re-analysis of classifiers with T1

We tested the five algorithms on T1 and created the ROC and the precision-recall plots. The result shows that the interpretation appears to be different between ROC and precision-recall.

ROC indicates good performance for all algorithms

The ROC curves and the AUC scores show that all classifiers have a very good to excellent overall performance level. The top performing classifiers, miRFinder and miPred, have very similar curves, but the curve of miPred seems slightly better than that of miRFinder. The AUC score of miRFinder (0.992) is better than that of miPred (0.991), but the difference is very small.

ROC plot for five different tools - miRFinder, miPred, RNAmicro, ProMiR, and RNAfold - on T1.
ROC curves for five different tools – miRFinder (red), miPred (blue), RNAmicro (green), ProMiR (purple), and RNAfold (orange) – tested on T1. AUC scores are shown next to the tool names in the legend.

Precision-recall reveals poor performance of some algorithms

The precision-recall curves and the AUC scores also show that all classifiers have a good to excellent performance level. Unlike ROC, it clearly shows a difference between the top two performing classifiers, miRFinder and miPred. Accordingly, the AUC score of miPred (0.976) is better than that of miRFinder (0.945).

Nonetheless, some algorithms clearly show declined precision values towards higher recall values. This deterioration of precision is strong for RNAmicro and also noticeable for RNAfold and ProMir to some extent.

Precision-Recall plot for five different tools - miRFinder, miPred, RNAmicro, ProMiR, and RNAfold - on T1.
Precision-recall plot for five different tools – miRFinder (red), miPred (blue), RNAmicro (green), ProMiR (purple), and RNAfold (orange) – tested on T1. AUC scores are shown next to the tool names in the legend.

Summary of the re-analysis with T1

The interpretation of the ROC plot under T1 leads to the following conclusion.

  • All classifiers perform well.
  • Two top performing classifiers are miRFinder and miPred.

In contrast, the interpretation of the precision-recall plot under T1 leads to the following conclusion.

  • All classifiers perform well.
  • Some algorithms show declined precision values towards higher recall values.
  • Two top performing classifiers are miRFinder and miPred.
  • miPred performs better than miRFinder.

RA3: Re-analysis of classifiers with T2

Similar to T1, we tested five algorithms on T2 and analysed the difference between ROC and precision-recall. Both T1 and T2 are imbalanced datasets, but T2 is more strongly imbalanced than T1.

ROC indicates good performance for all algorithms

The ROC plot on T2 shows very different curves to that on T1. This difference indicates that T2 is harder to be correctly predicted than T1. The curves and AUC scores indicate that all classifiers have a good performance level.

The plot shows that RNAmicro overperforms the other classifiers over a wide range of specificity, but miRFinder has the best performance in the early retrieval area.

ROC plot for five different tools - miRFinder, miPred, RNAmicro, ProMiR, and RNAfold - on T2.
ROC curves for five different tools – miRFinder (red), miPred (blue), RNAmicro (green), ProMiR (purple), and RNAfold (orange) – tested on T2. AUC scores are shown next to the tool names in the legend.

Precision-recall reveals poor performance of all algorithms

The precision-recall plot shows that the classifier performance generally deteriorates strongly under T2. Over the whole range of recall values, all algorithms except for miRFinder have very low precision values that are almost close to the baseline.

Precision-Recall plot for five different tools - miRFinder, miPred, RNAmicro, ProMiR, and RNAfold - on T2.
Precision-recall plot for five different tools – miRFinder (red), miPred (blue), RNAmicro (green), ProMiR (purple), and RNAfold (orange) – tested on T2. AUC scores are shown next to the tool names in the legend.

Summary of the re-analysis with T2

The ROC plot indicates all methods have a good performance level, but the precision-recall indicates all methods but miRFinder have a performance level closed to that of a random classifier.

The interpretation of the ROC plot under T2 leads to the following conclusion.

  • All classifiers perform well.
  • RNAmicro has the best overall performance.
  • miRFinder has the best early retrieval performance.

In contrast, the interpretation of the precision-recall plot under T2 leads to the following conclusion.

  • All classifiers have a very poor performance level under T2.
  • All classifiers except for miRFinder have a performance level closed to that of a random classifier

Conclusion

The results of re-analysis clearly revealed the advantages of precision-recall against ROC especially when a dataset is imbalanced. Moreover, precision-recall appears to be more intuitive than ROC when tested on both T1 and T2.

Relevant analyses

We have two more analyses regarding the difference between the ROC and the precision-recall plots. Please see the following pages.

References

[Huang2007] Ting-Hua Huang, Bin Fan, Max F Rothschild, Zhi-Liang Hu, Kui Li and Shu-Hong Zhao (2007) MiRFinder: an improved approach and software implementation for genome-wide fast microRNA precursor scans. BMC Bioinformatics. 8:341
[Jiang2007] Peng Jiang, Haonan Wu, Wenkai Wang, Wei Ma, Xiao Sun, and Zuhong Lu (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35(Web Server issue): W339–W344.
[Hertel2006] Jana Hertel and Peter F. Stadler (2006) Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data. Bioinformatics. 22(14):e197-202
[Nam2005] Jin-Wu Nam, Ki-Roo Shin, Jinju Han, Yoontae Lee, V. Narry Kim and Byoung-Tak Zhang (2005) Human microRNA prediction through a probabilistic co-learning model of sequence and structure. Nucleic Acids Res. 33(11): 3570–3581.
[Hofacker1994] Ivo L. Hofacker, Walter Fontana, Peter F. Stadler, L. Sebastian Bonhoeffer, Manfred Tacker, and Peter Schuster (1994) Fast Folding and Comparison of RNA Secondary Structures. Monatsh Chem. 125: 167-188
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s