Hi,
in order to validate a multivariate logistic regression model, I’d like to perform a bootstrap analysis, which resamples the residuals.
I am a student and new to SAS and have problems with the practical application. I‘ve read some explanations, which only focus on resampling observations not residuals. I did find the following explanation how to resample residuals in linear regression models but I have troubles to adapt it for my logistic model, since I model a probability of an event and not an actual value. Maybe someone could help me to adapt the following code or has other suggestions? I am using SAS 9.3. Any help is highly appreciated.
from http://www2.sas.com/proceedings/forum2007/183-2007.pdf
%let regressors = x; %let indata = temp1;
/* 1: perform the regression and get the predicted and residual values */
proc reg data= &INDATA;
model y=®RESSORS;
output out=out1 p=yhat r=res;
run;
/* 2: split the data: only the residuals will require URS */
data fit(keep=yhat ®RESSORS order) resid(keep=res);
set out1;
order+1;
run;
/* 3: this doesn’t do any sampling – it copies the FIT data set repeatedly */
proc surveyselect data=fit out=outfit method=srs samprate=1 rep=1000; run;
/* 4: this does the WR sampling of residuals for each replicate */
data outres2;
do replicate = 1 to 1000;
do order = 1 to numrecs;
p = ceil(numrecs * ranuni(394747373));
set resid nobs=numrecs point=p;
output;
end;
end;
stop;
run;
/* 5: then the randomized residuals are merged with the unrandomized records */
data prepped;
merge outfit outres2;
by replicate order;
new_y=yhat+res;
run;
/* 6: the bootstrap process runs on each replicate */
proc reg data=prepped outest=est1(drop=_:);
model new_y=®RESSORS;
by replicate;
run;
/* 7: and the sampling distribution is aggregated */
proc univariate data=est1;
var x;
output out=final pctlpts=2.5, 97.5 pctlpre=ci;
run;
proc print; run;
I think the only actual change is to the PROC REG. You would change that to PROC LOGISTIC.
The other main thing, which you've already pointed out, is that PROC LOGISTIC generates a probability not a 1/0 output.
I'm not sure how valid resampling residuals are for a binary output either, since your residuals will be -1, 0 or 1.
In essence though, you pick a cutoff, for example if PROB>0.7 then Event=1, else Event=0. You could probably add this into your STEP2 code. This will give you your estimate that can be then used in the remaining steps as outlined in your original post.
So this is easily technically possible but is it statistically valid, not sure.
Thank you both! I decided to apply a different validation method, since I am not sure about the validity of my findings.
This is the only 'implementation paper' I've found on this Topic. The authors combine bootstrapping with a weighting step for logistic regressions . For those who are interested...http://jsrad.org/wp-content/2016/Issue%205,%202016/9j.pdf
While not exactly what you are asking for, note that you can get statistics on a chosen validation fraction of your data by using the PARTITION statement in PROC HPLOGISTIC. You can also use the SELECTION statement with CHOOSE=VALIDATE if you want to do model selection using statistics computed on the validation data to select effects in the model. Also note that predicted probabilities from the fitted model using a "leave one out" cross-validation approximation are available in PROC LOGISTIC using the PREDPROBS=CROSSVALIDATE option in the OUTPUT statement. Cross-validated predicted probabilities are also used in producing the classification results from the CTABLE option.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.