- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am running an experiment where I am comparing the performance of various models using the precision recall area under the curve (PR AUC) score. Other than comparing the raw scores against each other, are there any ways to compare the scores against each other in more statistically rigorous manner? Some candidate tests I read about were the Wilcoxon Signed Ranks and the Mann-Whitney U tests, but I am not sure how I can apply either to the results of my analysis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As noted in the list of Frequently Asked-for Statistics (FASTats, see the Important Links section of the Statistical Procedures Community page), the precision-recall curve and the area under it can be displayed either using PROC LOGISTIC in SAS Viya or using the PRcurve macro in SAS 9. But unlike the ROC curve, a test is not available to compare the areas under PR curves from competing models. But areas under ROC curves can be compared using the ROCCONTRAST statement in PROC LOGISTIC in SAS 9 or SAS Viya. See the example in the PROC LOGISTIC documentation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Unless you have multiple PR AUC estimates per model, or you can aggregate some models together, you probably won't be able to use the nonparametric tests you list. The same would be true of the parametric tests that come to my mind. Somehow, you need to find a measure of variability within your grouping variable/model in order to do any testing or generation of confidence bounds.
SteveDenham
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hmmm...so what I did was train k models for the control, and k models for every 'treatment' condition using k-fold cross validation. I calculated the PR AUC score on every test fold and calculated the mean PR AUC score across all folds for every model 'condition' (control, and each treatment separately) I trained. I wanted to compare the mean PR AUC scores from each of the 'treatment' models against the control condition (to see if there are real differences in the results) and compare the 'treatment' model scores against each other. Does the setup I described above not work for either the non-parametric or parametric model comparison methods?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content