- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I am trying to externally validate a logistic regression model.
From my reading - I have identified I need to
1. calculate the Linear Predictor,
2. Transport the original regression coefficients to the external dataset and calculate the linear predictor.
3. Determine the amount of miscalibration in intercept and calibration slope.
4. Determine Discrimination and Calibration in the external dataset
I have the coefficients and values required for step 1 but do not know what code I'd need to achieve these steps.
Does anyone have code that would work for this please?
Many thanks,
Craig
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
See this note (http://support.sas.com/kb/22630) that discusses evaluating calibration and discrimination in logistic models.
Using a separate data set for evaluation, the misclassification error rate and several other statistics can be obtained for training and test (external) data sets using the PARTITION statement in PROC HPLOGISTIC. The input data set should combine the two data sets with a variable that identifies the two. The specified model will be fit to the training data and evaluated using multiple statistics for both parts of the data. See the example titled "Partitioning Data" in the HPLOGISTIC documentation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Het Stat Dave,
Thank you for your explanation - that resource for discrimination / calibration is helpful - I have used several of these during model development 🙂
So to clarify - do I "merge" both datasets? Or what is the process for this? I have the coefficients from the development model but am unsure how to proceed from there.
Have a nice Easter,
Craig
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data both; set training external(in=t); test=t; run;
You then specify DATA=BOTH in PROC HPLOGISTIC. With the PARTITION statement as shown in the documentation example I referred to, that procedure will first re-estimate your model using the training data and then evaluate the model on both the training and external (test) data set. So, it does it all - fitting and validating - in one PROC step.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hey StatsDave,
Thanks for the answer. Ah I see - thank you for the explanation. I was informed that coefficients i.e the Linear Predictor needed to be the same for validation - I'm assuming this method does this and fits TRIPOD guidelines?
I'll draw up some code and can try with a colleague next week and report back.
Many thanks,
Craig
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Stats Dave,
Thanks for your advice - I've drawn up some code using the links your provided. Can I just check this code with yourself - variable names should match and I'm looking to have ROC values with CIs / plot and calibration plot ?
Best wishes,
Craig
*XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !! TEST CODE FOR VALIDATION !! N.B ORDER OF CASE AND CONTROL?? XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX; *PROC HPLOGISTIC DATA = BOTH DESCENDING plots=calibration(CLM ShowObs); FORMAT SEX SEX. EDUCATC EDUCATC. SMOKE SMOKE. DRINK DRINK. ALCOIC ALCOIC. ; CLASS SEX ( REF = '0: Female' ) EDUCATC ( REF = '16+ Years' ) SMOKE ( REF = '0: Never') DRINK (REF = '0: Never') ALCOIC (REF = '6: Never') ; MODEL HNC = AGE SEX EDUCATC SMOKE SMOKP DRINK ALCOIC / ROCCI ROCCONTRAST RL GOF; RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
PROC HPLOGISTIC does not have the same statements and options as PROC LOGISTIC. See its documentation. But if what you want is the calibration plot and the area under the ROC curve (AUC) for your test data, then you could just use PROC LOGISTIC as shown in the example titled "Goodness-of-Fit Tests and Calibration" in the LOGISTIC documentation. That example fits the model to a set of training data and the fitted model is then evaluated using a set of test data. The two data sets are stacked as we discussed. At the end of the example, PROC LOGISTIC is again run to evaluate the model on the test portion of the data. That PROC LOGISTIC could be modified as below to provide the AUC and the Brier score that is mentioned in the note I referred to earlier.
proc logistic data=Scored(where=(InLaborForce2=.))
plots(only)=calibration(clm showobs);
model InLaborForce = XBeta / gof;
score data=Scored(where=(InLaborForce2=.)) out=_null_ fitstat;
roc;
test intercept = 0, XBeta = 1;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Forgive me, I am struggling to follow this.
So I need to validate the model using the stacked data, assess ROC with confidence intervals and assess calibration. The validation needs to use the same coefficients from the original dataset. The variables and reference values shouldn't change but how would I rewrite it so these things exist for HPLogistic please? Could you even modify the statement below to show me what needs changed please?
Many thanks,
Craig
*PROC LOGISTIC DATA = BOTH plots=calibration(CLM ShowObs); FORMAT SEX SEX. EDUCATC EDUCATC. SMOKE SMOKE. DRINK DRINK. ALCOIC ALCOIC. ; CLASS SEX ( REF = '0: Female' ) EDUCATC ( REF = '16+ Years' ) SMOKE ( REF = '0: Never') DRINK (REF = '0: Never') ALCOIC (REF = '6: Never') ; MODEL HNC = AGE SEX EDUCATC SMOKE SMOKP DRINK ALCOIC / ROCCI ROCCONTRAST RL GOF; RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you use PROC LOGISTIC, the example titled "Goodness-of-Fit Tests and Calibration" in the LOGISTIC documentation goes through the process you need to follow for your data and model which involves one PROC LOGISTIC step to fit the model to the training data and save the linear predictor (XBETA) values for both training and test data. A second PROC LOGISTIC step is then used that fits a model to the test data alone using only the saved XBETA values in order to evaluate the model from the training data on the test data (This works because the XBETA values for the observations in the test data are predicted values from the model on the training data). I added the GOF option and the ROC and SCORE statements in this second step to provide the additional statistics you are interested in.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Dave
I'll try this tomorrow and report back 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the delay in getting back - swamped w things.
So I have attempted to make sure all variables are the same in nature and format. Only exception is outcome variable but if this is binary guessing it doesn't matter?
I have exported the model coefficients from the development file into a .rtf file. - is this correct or what would be the next steps?
Many thanks,
Craig
FILENAME COEF 'J:\MED\DentalSchool\DPHU\CraigSmith\ARCAGE\model_coef.txt.'; /* Step 2: Load the coefficients from the file */ data COEF; infile 'model_coef.txt'; input VARIABLE : $20. ESTIMATE; run; /* Step 3: Merge the coefficients with the validation dataset */ proc sql; create table VALID_COEF as select A.*, B.ESTIMATE from VALID A left join COEF B on A.VARIABLE = B.VARIABLE; quit; /* Step 4: Calculate the predicted probabilities */ data VALID_PRED; set VALID_COEF; PROB = 1 / (1 + exp(-(& 0.4692. + ESTIMATE))); run; /* Step 5: Calculate the model performance */ proc logistic data=VALID_PRED; model HNC (event='1') = PROB / nofit; roc 'Model Performance' / ci=none; run;