Solved: Re: Estimates from a logistic regression model with bootstraps

LD4224 · Posted 06-21-2018 03:46 PM

Hello - I'm trying to derive and save an unbiased logistic regression model using the outputted estimates from 500 bootstraps. I derived a logistic regression model in my development set, then used the retained variables in a model statement and ran it in 500 bootstrapped replicates of my development set. I then retrieved the estimates from the ODS ParameterEstimates tables and calculated the medians of the intercepts and beta estimates. (Let me know if anyone disagrees with this approach.) I now want to save this model and use it to score an external dataset. I could hard code it, but want the output I could get by using the Score statement in Proc Logisitic. Any help would be appreciated! Thanks

StatDave · Posted 06-22-2018 10:04 AM

I assume that whatever you did has left you with a final set of model coefficients that you want to use to score a new data set for the purpose of obtaining the ROC analysis. You can score the data from the coefficients as outlined in section 4 of this note. Using the logistic() function as mentioned there to get the predicted probabilities for the new data, you can then use the PRED= option in the SCORE statement as shown in this note to get the ROC analysis.

View solution in original post

PaigeMiller · Posted 06-21-2018 03:52 PM

@LD4224 wrote:

I now want to save this model and use it to score an external dataset.

What model? The model of the bootstrap medians?

I don't think you can get this any other way than by hard-coding it, or perhaps by clever use of a macro.

I have not heard of using the bootstrap medians as the new model.

I derived a logistic regression model in my development set, then used the retained variables in a model statement and ran it in 500 bootstrapped replicates of my development set. I then retrieved the estimates from the ODS ParameterEstimates tables and calculated the medians of the intercepts and beta estimates. (Let me know if anyone disagrees with this approach.)

I assume this means you did some for of stepwise selection (or forward or backward selection) when you fit the original model, and of course, I think then that the bootstrap ought to also do the stepwise and see if different variables are selected, that would be important to know.

Also, I strenuously object to the title of this post, and almost reported it as spam — MODERATORS can you change the title?

--
Paige Miller

pau13rown · Posted 06-21-2018 06:04 PM

i dont understand. Is the issue that you just want to avoid hard coding? if so, why not take the median (across bootstraps) from eg univariate and make it a macro variable and then use this as a coefficient in the proc logistic code score statement?

LD4224 · Posted 06-21-2018 07:42 PM

Thanks for replying.

To clarify:

Variable selection was done in the development set using the AIC (stepwise selection, SLENTRY=1 SLSTAY=1). I realize that Harrell and others recommend using bootstrapping for variable selection, but I'm sticking with the stepwise AIC approach.

Using the mean or median of the coefficients obtained in the bootstrapped samples is referred to as "bootstrap aggregating" or "bagging" of coefficients.

It's not that I want to avoid hard scoring. The way I've used Score in the past is as such, which allows me to get the ROC graphs and c statistic for the scored dataset.

PROC LOGISTIC DATA=WORK.BRAIST_SMS;
CLASS THORACIC (REF='0' PARAM=REF) SMSC2 (REF='1.2' PARAM=REF);
MODEL VTERM (EVENT='1') = SMSC2 THORACIC COBBMAX;

SCORE DATA=WORK.VALID_SMS OUT=VALIDP OUTROC=VROC;
ROC;

ROCCONTRAST;

Is there a way to take the coefficients from the bootstraps and create something that would function like the "outmodel" does below?
proc logistic data = hsb2 outmodel=pout;
model honcomp = read math;
run;

proc logistic inmodel=pout;

score clm data = toscore out=pred ;

run;

Ideas?

Thanks

pau13rown · Posted 06-21-2018 08:38 PM

still not clear to me what youre asking. Consider my original answer ie use a macro variable. You didn't indicate why that wouldn't work - if you did that would help me understand your question. Don't worry about explaining the bootstrap etc, i get that, i just don't know what you want (if it's not what i already assumed)

edit: re "It's not that I want to avoid hard scoring" [i assume you meant hard coding], clearly you don't want to hard code, otherwise you would do it in 2 seconds, and you said yourself: "I could hard code it, but ...". So that seems to me to be the issue and that's easily solved with a macro variable

LD4224 · Posted 06-21-2018 11:39 PM

Thanks. Can you explain how the maco would function and how I would write it? Or even how I would hard code it? I have looked all over and can't find any examples.

StatDave · Posted 06-22-2018 10:04 AM

I assume that whatever you did has left you with a final set of model coefficients that you want to use to score a new data set for the purpose of obtaining the ROC analysis. You can score the data from the coefficients as outlined in section 4 of this note. Using the logistic() function as mentioned there to get the predicted probabilities for the new data, you can then use the PRED= option in the SCORE statement as shown in this note to get the ROC analysis.

pau13rown · Posted 06-22-2018 06:39 PM

incidentally, that is what i meant by 'hard coding' ie in that example they simply write out the coefficients. It would be better to define macros variables to minimise the possibility of misspecifying the model i guess. For example, the following type of thing is not uncommon:

proc univariate data=....;
   var x;
   output out=m1 mean=mean;
run;

data m2;
   set m1;
   call symput ('mean1', mean);
run;

proc nlmixed data=....;
:

:

estimate 'Treatment A' exp(mu + &mean1.*b1 + 0.5*b2 + b3 + b4);
estimate 'Treatment B' exp(mu + &mean1.*b1 + 0.5*b2 + b4);
run;

Registration is open