BookmarkSubscribeRSS Feed
blue34
Calcite | Level 5

I have one classification model and tested on two samples. I have two confusion matrix for both samples, one of them is for female, other is male only, both samples are not paired. I calculated precision, recall, F1 score for both and how can I statistically compare precision, recall, F1 score between female/male? Which test i can use ?

6 REPLIES 6
Cynthia_sas
SAS Super FREQ

Hi:

  In order to help you with the SAS Academy for Data Science classes, we need to know the following information:

1) What level are you working in:

2) Title of class:

3) Lesson in class:

4) Title of Demo or Exercise:

5) Any error messages:

6) Description of problem:

 

  If you are asking a "general" statistical analysis question using SAS/STAT procedures, then you probably want to post your question in a forum that is not aimed at the SAS Academy for Data Science students. Here's the Statistical Analysis forum link: https://communities.sas.com/t5/Statistical-Procedures/bd-p/statistical_procedures (one of them at least -- you'll find the others up on the drop down menu under Analytics).

 

Cynthia

StatDave
SAS Super FREQ

Anytime you have separate models and want to make a comparison, you need to devise a single, omnibus model that can be used to make comparisons. The statistics you mention, recall (also known as sensitivity) and precision (also known as positive predictive value or PPV), are computed from a 2x2 table. In your case, it appears you have two such tables which can be thought of as a single 2x2x2 table. You can estimate and test the statistics and their differences by fitting a single model to this single table. Analysis of tabular data like this is typically done with a Poisson model fit to the counts of the table. These two statistics are ratios involving the cell counts, so are therefore nonlinear combinations. The easiest way to fit the model and estimate and test the statistics is to fit a Poisson model to the cell counts of the table using PROC NLMIXED. You can use ESTIMATE statements to do the estimation and testing. 

 

As an example, see the neuralgia data in the example titled "Logistic Modeling with Categorical Predictors" in the PROC LOGISTIC documentation. We can use the Sex variable and a dichotomized Duration variable along with the binary Pain variable to result in a 2x2x2 data. The following statements creates the dichotomized Duration variable, Dur, and uses PROC FREQ to create a data set of cell counts, Cells. The PROC NLMIXED step fits a Poisson model to the data. Note that a single parameter in the model is associated with each of the eight cells in the table. Since the Poisson model models the log of the mean, exponentiating any parameter produces an estimate of the cell that parameter is associated with. Using this fact, you can specify the formula for each statistic in an ESTIMATE statement with exp(parameter) in the place of each cell count in the formula. That is done for each of the statistics (precision, recall, and F1) for each sex. An additional ESTIMATE statement is specified that simply estimates the difference between the sexes for each statistic. 

 

data n; 
set neuralgia;
Dur=(duration>=20);
run;

proc freq data=n;
table sex*dur*pain/out=cells; 
run;

proc nlmixed data=cells df=1e8;
mu=exp( f0n*(sex="F" and dur=0 and pain="No") + f0y*(sex="F" and dur=0 and pain="Yes") + 
f1n*(sex="F" and dur=1 and pain="No") + f1y*(sex="F" and dur=1 and pain="Yes") + 
m0n*(sex="M" and dur=0 and pain="No") + m0y*(sex="M" and dur=0 and pain="Yes") + 
m1n*(sex="M" and dur=1 and pain="No") + m1y*(sex="M" and dur=1 and pain="Yes") );
model count ~ poisson(mu);
estimate 'F precision' exp(f1y)/(exp(f1y)+exp(f1n));
estimate 'M precision' exp(m1y)/(exp(m1y)+exp(m1n));
estimate 'Precision: M-F' exp(m1y)/(exp(m1y)+exp(m1n)) - exp(f1y)/(exp(f1y)+exp(f1n));
estimate 'F recall' exp(f1y)/(exp(f1y)+exp(f0y));
estimate 'M recall' exp(m1y)/(exp(m1y)+exp(m0y));
estimate 'Recall: M-F' exp(m1y)/(exp(m1y)+exp(m0y)) - exp(f1y)/(exp(f1y)+exp(f0y));
estimate 'F1 F' 2*( ( exp(f1y)/(exp(f1y)+exp(f1n)) * exp(f1y)/(exp(f1y)+exp(f0y)) ) / 
                    ( exp(f1y)/(exp(f1y)+exp(f1n)) + exp(f1y)/(exp(f1y)+exp(f0y)) )
                  );
estimate 'F1 M' 2*( ( exp(m1y)/(exp(m1y)+exp(m1n)) * exp(m1y)/(exp(m1y)+exp(m0y)) ) / 
                    ( exp(m1y)/(exp(m1y)+exp(m1n)) + exp(m1y)/(exp(m1y)+exp(m0y)) )
                  );
estimate 'F1 M-F' 
                2*( ( exp(m1y)/(exp(m1y)+exp(m1n)) * exp(m1y)/(exp(m1y)+exp(m0y)) ) / 
                    ( exp(m1y)/(exp(m1y)+exp(m1n)) + exp(m1y)/(exp(m1y)+exp(m0y)) )
                  ) - 
                2*( ( exp(f1y)/(exp(f1y)+exp(f1n)) * exp(f1y)/(exp(f1y)+exp(f0y)) ) / 
                    ( exp(f1y)/(exp(f1y)+exp(f1n)) + exp(f1y)/(exp(f1y)+exp(f0y)) )
                  );
run;
blue34
Calcite | Level 5
I thought this is very simple Z test I should apply, Z test for precision of female and male and check the p value. Also Z test for recall of female and male and check the p value whether it is statistically same or not.

I am not sure if we need this complication. What do you think?
blue34
Calcite | Level 5
Also, I don't have separate models. I have one model and tested on two different samples.

I stated as "I have one classification model and tested on two samples"
StatDave
SAS Super FREQ

If you fit your model separately to males and to females so that you have two sets of parameter estimates, then you have two models - the same model form, yes, but still two estimated models. By fitting the model to ALL the data allows you to use the model parameters to estimate each statistic and compare them using an appropriate measure of variance.  That is what that code does.

 

I suppose an alternative to that, assuming you have the estimates and their variance for the two sexes, would be to compute the standard error of the difference as the square root of the sum of the two variances. The difference divided by the standard error of the difference would then be a Z statistic. That could be done for precision and again for recall assuming that those difference statistics are normally distributed and that the measures in the two sexes are independent.