BookmarkSubscribeRSS Feed
GuyTreepwood
Obsidian | Level 7

Hello,

 

I am running an experiment where I am comparing the performance of various models using the precision recall area under the curve (PR AUC) score. Other than comparing the raw scores against each other, are there any ways to compare the scores against each other in more statistically rigorous manner? Some candidate tests I read about were the Wilcoxon Signed Ranks and the Mann-Whitney U tests, but I am not sure how I can apply either to the results of my analysis. 

4 REPLIES 4
StatDave
SAS Super FREQ

As noted in the list of Frequently Asked-for Statistics (FASTats, see the Important Links section of the Statistical Procedures Community page), the precision-recall curve and the area under it can be displayed either using PROC LOGISTIC in SAS Viya or using the PRcurve macro in SAS 9. But unlike the ROC curve, a test is not available to compare the areas under PR curves from  competing models. But areas under ROC curves can be compared using the ROCCONTRAST statement in PROC LOGISTIC in SAS 9 or SAS Viya. See the example in the PROC LOGISTIC documentation.

SteveDenham
Jade | Level 19

Unless you have multiple PR AUC estimates per model, or you can aggregate some models together, you probably won't be able to use the nonparametric tests you list. The same would be true of the parametric tests that come to my mind. Somehow, you need to find a measure of variability within your grouping variable/model in order to do any testing or generation of confidence bounds.

 

SteveDenham

GuyTreepwood
Obsidian | Level 7

Hmmm...so what I did was train k models for the control, and k models for every 'treatment' condition using k-fold cross validation. I calculated the PR AUC score on every test fold and calculated the mean PR AUC score across all folds for every model 'condition' (control, and each treatment separately) I trained. I wanted to compare the mean PR AUC scores from each of the 'treatment' models against the control condition (to see if there are real differences in the results) and compare the 'treatment' model scores against each other. Does the setup I described above not work for either the non-parametric or parametric model comparison methods? 

StatDave
SAS Super FREQ
In order for the treatment PR curves to be independent, the treatment models must have been fit to independent groups of subjects. You could, in that case, fit a single model to each and compare the ROC AUCs as described in SAS Note 45339 (http://support.sas.com/kb/45339). PROC LOGISTIC doesn't provide a standard error for the PR AUC as it does for the ROC AUC. If the same subjects were used for the treatments then the PR or ROC curves would be dependent and, similarly to as described for the ROC AUCs in the LOGISTIC documentation, you would presumably need to estimate a covariance matrix among the AUC values. Or maybe take a different approach - if the goal is to compare the treatments to control, you could consider simply fitting a single model that includes your multi-level treatment variable (which includes control) as a CLASS variable, possibly interacting it with any other predictors in the model. Then you could use tests on the parameters or contrasts of the parameters to make inferences about differences among the treatments or about the effects of other predictors differing among the treatments.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1596 views
  • 4 likes
  • 3 in conversation