Understanding bootstrapping macro for logistic regression validation

bkq32 · Posted 11-10-2020 10:26 PM

I'm trying to figure out what happens during each step of this macro that runs a logistic regression on 10 bootstrapped samples. How are the tables "bval1" and "bval2" different? The tables have the same number of records, but I'm not sure what part of the code makes them different. I'm hoping once I know, I can figure out what the difference is between AUC1, AUC2, and AUC3.

*Make sample dataset;
data bweight ( drop = weight visit momedlevel );
 format subjectid heavy;
 set sashelp.bweight;
 subjectid = _N_;
 if weight > 4500 then heavy = 1;
  else heavy = 0;
run;



******************************************************************;
/*  BVAL macro                                                   */
/*  Author: Mithat Gonen                                         */
/*                                                               */
/*                                                               */
/*  Performs bootstrap validation                                */
/*                                                               */
/*  INPUTS                                                       */
/*                                                               */
/*  dsn:    data set name                                        */
/*  outcome:independent variable                                 */
/*  covars: list of dependent variables separated by blanks      */
/*  B:      Number of bootstrap samples                          */
/*  sel:    Selection method for logistic regression             */
/*                                                               */
******************************************************************;
%macro bval(dsn=,outcome=,covars=,B=10);
proc sql noprint;
  select n(&outcome) into:_n from &dsn;
run;
proc surveyselect data=&dsn method=urs outhits rep=&B n=&_n out=bsamples noprint;
run;
%do i=1 %to &B;
  proc logistic data=bsamples(where=(replicate=&i)) outmodel=_mod&i noprint;
    model &outcome=&covars;
  run;
  proc printto file='junk.txt';
  proc logistic inmodel=_mod&i;
    score data=&dsn out=out1&i;
  run;
  proc logistic inmodel=_mod&i;
    score data=bsamples(where=(replicate=&i)) out=out2&i;
  run;
  proc printto;run;  
%end;
  data bval1;
    set %do j=1 %to &B;out1&j(in=in&j) %end;;
	%do j=1 %to &B; if in&j then bsamp=&j; %end;
  run;
  data bval2;
    set %do j=1 %to &B;out2&j(in=in&j) %end;;
	%do j=1 %to &B; if in&j then bsamp=&j; %end;
  run;


proc printto file='junk.txt' new;
proc logistic data=bval1;
  by bsamp;
  model &outcome=p_1;
  ods output association=assoc1;
run;

proc logistic data=bval2;
  by bsamp;
  model &outcome=p_1;
  ods output association=assoc2;
run;
proc logistic data=&dsn;
  model &outcome=&covars;
  ods output association=assoc3;
run;
proc printto;

data assoc3;
  set assoc3;
  bsamp=1;
run;

data optim;
  merge assoc1(where=(label2='c') keep=bsamp label2 nvalue2 rename=(nvalue2=auc1))
        assoc2(where=(label2='c') keep=bsamp label2 nvalue2 rename=(nvalue2=auc2))
        assoc3(where=(label2='c') keep=bsamp label2 nvalue2 rename=(nvalue2=auc3));
  by bsamp;
run;

proc sql;
  select mean(auc3) as OptimisticAUC, mean(auc2-auc1) as OptimisimCorrection, 
		 mean(auc3)-mean(auc2-auc1) as CorrectedAUC from optim;
quit;
%mend;
        
%bval(dsn=bweight,outcome=heavy,covars=black married boy momage cigsperday,B=10);

PaigeMiller · Posted 11-11-2020 06:16 AM

@bkq32 wrote:

I'm trying to figure out what happens during each step of this macro that runs a logistic regression on 10 bootstrapped samples. How are the tables "bval1" and "bval2" different? The tables have the same number of records, but I'm not sure what part of the code makes them different. I'm hoping once I know, I can figure out what the difference is between AUC1, AUC2, and AUC3.

May I suggest you contact the author for questions about the macro?

BVAL1 seems to be the results of the logistic regression predictions applied to all observations using the i-th regression, then all i bootstrap samples combined. BVAL2 seems to be the results of the bootstrap logistic regression predictions only on the observations in the i-th sample, and then all samples combined.

--
Paige Miller

Ksharp · Posted 11-11-2020 07:07 AM

Calling @Rick_SAS

Rick_SAS · Posted 11-11-2020 08:28 AM

To restate what Paige said:

BVAL1 is the result of scoring the original data by using the parameter estimates from each bootstrap sample.

BVAL2 is the result of scoring each bootstrap sample by using the parameter estimates from that bootstrap sample. This is the "best possible" AUC because you are scoring the same data you used to fit the model.

bkq32 · Posted 11-11-2020 04:15 PM

Thank you, everyone - that makes sense. Do you know why the OptimismCorrection is needed? Why not just report mean(AUC1)?

Also, if I'm modeling the probability of the event, should I modify the macro such that every PROC LOGISTIC DATA= statement has the descending option?

I can also contact the author like Paige suggested if that's easier.

Understanding bootstrapping macro for logistic regression validation

Re: Understanding bootstrapping macro for logistic regression validation

Re: Understanding bootstrapping macro for logistic regression validation

Re: Understanding bootstrapping macro for logistic regression validation

Re: Understanding bootstrapping macro for logistic regression validation

Understanding bootstrapping macro for logistic regression validation

Re: Understanding bootstrapping macro for logistic regression validation

Re: Understanding bootstrapping macro for logistic regression validation

Re: Understanding bootstrapping macro for logistic regression validation

Re: Understanding bootstrapping macro for logistic regression validation

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away