Re: External Validation of a Logistic Regression Model

CraigSmith · Posted 04-04-2023 05:55 AM

Dear all,

I am trying to externally validate a logistic regression model.

From my reading - I have identified I need to

1. calculate the Linear Predictor,

2. Transport the original regression coefficients to the external dataset and calculate the linear predictor.

3. Determine the amount of miscalibration in intercept and calibration slope.

4. Determine Discrimination and Calibration in the external dataset

I have the coefficients and values required for step 1 but do not know what code I'd need to achieve these steps.

Does anyone have code that would work for this please?

Many thanks,

Craig

StatDave · Posted 04-04-2023 12:25 PM

See this note (http://support.sas.com/kb/22630) that discusses evaluating calibration and discrimination in logistic models.

Using a separate data set for evaluation, the misclassification error rate and several other statistics can be obtained for training and test (external) data sets using the PARTITION statement in PROC HPLOGISTIC. The input data set should combine the two data sets with a variable that identifies the two. The specified model will be fit to the training data and evaluated using multiple statistics for both parts of the data. See the example titled "Partitioning Data" in the HPLOGISTIC documentation.

CraigSmith · Posted 04-08-2023 09:16 AM

Het Stat Dave,

Thank you for your explanation - that resource for discrimination / calibration is helpful - I have used several of these during model development 🙂

So to clarify - do I "merge" both datasets? Or what is the process for this? I have the coefficients from the development model but am unsure how to proceed from there.

Have a nice Easter,

Craig

StatDave · Posted 04-08-2023 11:15 AM

You don't need the coefficients... they will be re-estimated by PROC HPLOGISTIC. You "concatenate" or "stack" (rather than "merge") your two data sets. Merging, in SAS parlance, combines them side to side. Concatenating adds one on to the end of the other. This is easily done in a DATA step (be sure the data sets use the same variable names) and you can add a TEST variable to identify the two portions:
data both; set training external(in=t); test=t; run;
You then specify DATA=BOTH in PROC HPLOGISTIC. With the PARTITION statement as shown in the documentation example I referred to, that procedure will first re-estimate your model using the training data and then evaluate the model on both the training and external (test) data set. So, it does it all - fitting and validating - in one PROC step.

CraigSmith · Posted 04-08-2023 11:46 AM

Hey StatsDave,

Thanks for the answer. Ah I see - thank you for the explanation. I was informed that coefficients i.e the Linear Predictor needed to be the same for validation - I'm assuming this method does this and fits TRIPOD guidelines?

I'll draw up some code and can try with a colleague next week and report back.

Many thanks,

Craig

CraigSmith · Posted 04-11-2023 10:07 AM

Hi Stats Dave,

Thanks for your advice - I've drawn up some code using the links your provided. Can I just check this code with yourself - variable names should match and I'm looking to have ROC values with CIs / plot and calibration plot ?

Best wishes,

Craig

*XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
				!! TEST CODE FOR VALIDATION !! N.B ORDER OF CASE AND CONTROL??
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX;

*PROC HPLOGISTIC DATA = BOTH DESCENDING plots=calibration(CLM ShowObs);
FORMAT SEX SEX. EDUCATC EDUCATC. SMOKE SMOKE. DRINK DRINK. ALCOIC ALCOIC. ;
CLASS 	
		SEX ( REF = '0: Female' )
		EDUCATC	( REF = '16+ Years' )
		SMOKE ( REF = '0: Never')
		DRINK (REF = '0: Never')
		ALCOIC (REF = '6: Never') ;
        
MODEL HNC = AGE SEX EDUCATC
             SMOKE SMOKP
             DRINK ALCOIC
           / ROCCI ROCCONTRAST RL GOF;
RUN;

StatDave · Posted 04-11-2023 12:12 PM

PROC HPLOGISTIC does not have the same statements and options as PROC LOGISTIC. See its documentation. But if what you want is the calibration plot and the area under the ROC curve (AUC) for your test data, then you could just use PROC LOGISTIC as shown in the example titled "Goodness-of-Fit Tests and Calibration" in the LOGISTIC documentation. That example fits the model to a set of training data and the fitted model is then evaluated using a set of test data. The two data sets are stacked as we discussed. At the end of the example, PROC LOGISTIC is again run to evaluate the model on the test portion of the data. That PROC LOGISTIC could be modified as below to provide the AUC and the Brier score that is mentioned in the note I referred to earlier.

proc logistic data=Scored(where=(InLaborForce2=.))
              plots(only)=calibration(clm showobs);
   model InLaborForce = XBeta / gof;
   score data=Scored(where=(InLaborForce2=.)) out=_null_ fitstat;
   roc;
   test intercept = 0, XBeta = 1;
run;

CraigSmith · Posted 04-12-2023 06:46 AM

Forgive me, I am struggling to follow this.

So I need to validate the model using the stacked data, assess ROC with confidence intervals and assess calibration. The validation needs to use the same coefficients from the original dataset. The variables and reference values shouldn't change but how would I rewrite it so these things exist for HPLogistic please? Could you even modify the statement below to show me what needs changed please?

Many thanks,

Craig

*PROC LOGISTIC DATA = BOTH plots=calibration(CLM ShowObs);
FORMAT SEX SEX. EDUCATC EDUCATC. SMOKE SMOKE. DRINK DRINK. ALCOIC ALCOIC. ;
CLASS 	
		SEX ( REF = '0: Female' )
		EDUCATC	( REF = '16+ Years' )
		SMOKE ( REF = '0: Never')
		DRINK (REF = '0: Never')
		ALCOIC (REF = '6: Never') ;
        
MODEL HNC = AGE SEX EDUCATC
             SMOKE SMOKP
             DRINK ALCOIC
           / ROCCI ROCCONTRAST RL GOF;
RUN;

StatDave · Posted 04-12-2023 12:04 PM

Since you seem to want a calibration plot, which PROC HPLOGISTIC cannot provide, I am suggesting that you use PROC LOGISTIC instead. But if you just want the Brier score (labeled "Average Square Error"), which is a calibration measure, and the ROC area, then HPLOGISTIC could be used with the PARTITION statement as shown in the "Partitioning Data" example in the HPLOGISTIC documentation. Those statistics are provided for both the training and test data.

If you use PROC LOGISTIC, the example titled "Goodness-of-Fit Tests and Calibration" in the LOGISTIC documentation goes through the process you need to follow for your data and model which involves one PROC LOGISTIC step to fit the model to the training data and save the linear predictor (XBETA) values for both training and test data. A second PROC LOGISTIC step is then used that fits a model to the test data alone using only the saved XBETA values in order to evaluate the model from the training data on the test data (This works because the XBETA values for the observations in the test data are predicted values from the model on the training data). I added the GOF option and the ROC and SCORE statements in this second step to provide the additional statistics you are interested in.

CraigSmith · Posted 04-12-2023 05:13 PM

Thanks Dave

I'll try this tomorrow and report back 🙂

CraigSmith · Posted 04-18-2023 06:17 AM

Sorry for the delay in getting back - swamped w things.

So I have attempted to make sure all variables are the same in nature and format. Only exception is outcome variable but if this is binary guessing it doesn't matter?

I have exported the model coefficients from the development file into a .rtf file. - is this correct or what would be the next steps?

Many thanks,

Craig

FILENAME COEF 'J:\MED\DentalSchool\DPHU\CraigSmith\ARCAGE\model_coef.txt.';

/* Step 2: Load the coefficients from the file */
data COEF;
   infile 'model_coef.txt';
   input VARIABLE : $20. ESTIMATE;
run;

/* Step 3: Merge the coefficients with the validation dataset */
proc sql;
   create table VALID_COEF as
   select A.*, B.ESTIMATE
   from VALID A left join COEF B
   on A.VARIABLE = B.VARIABLE;
quit;

/* Step 4: Calculate the predicted probabilities */
data VALID_PRED;
   set VALID_COEF;
   PROB = 1 / (1 + exp(-(& 0.4692. + ESTIMATE)));
run;

/* Step 5: Calculate the model performance */
proc logistic data=VALID_PRED;
   model HNC (event='1') = PROB / nofit;
   roc 'Model Performance' / ci=none;
run;