Solved: Re: Output logistic regression bootstrap

antor82 · Posted 06-30-2019 03:12 AM

Dear All

I thank You in advance for Your kind support.

I'm running a bootstrapping after a logistic regression and I would like to print the results. (SAS/STAT 15.1)

This is the code

/* 2. Generate many bootstrap samples */
proc surveyselect data=dbsname NOPRINT seed=123456
out=Bootout
method=urs 
samprate=1
reps=1000;

%macro ODSOff(); /* Call prior to BY-group processing */
ods graphics off;
ods exclude all;
ods noresults;
%mend;
 
%macro ODSOn(); /* Call after BY-group processing */
ods graphics on;
ods exclude none;
ods results;
%mend;

%ODSOff
PROC LOGISTIC data=Bootout;
    BY Replicate; 
	CLASS Female (param=ref ref='No') ChronicLungDisease (param=ref ref='No');
	MODEL Out2InHospitalOr30DayDeath(event='1')=Female ChronicLungDisease / 
		SELECTION=Backward clodds=pl gof;
	Title 'Logistic Model InHosp or 30d - ONLY preop';
	format Female ChronicLungDisease yn.;
	ods output CLoddsPL=CL_boot_Mort_mod_1;
run;
%ODSon

proc univariate data=cl_boot_mort_mod_1 noprint;
   class Effect;
   var  OddsRatioEst;
   output out=WidePctls1 pctlpre=P_ pctlpts=2.5 97.5 mean=Mean Std=Std; 
run; 


proc print data=WidePctls1 noobs label;
   format Mean Std P_2_5 P_97_5 6.4;
   label Mean="BootMean" Std="BootStdErr" P_2_5="95% Lower CL" P_97_5="95% Upper CL";
run;

I wonder why I get these results

Screenshot from DBS CL_Boot_Mort_Mod_1Screenshot from Output Data WidePctls1Screenshot from Results

It seems like if "Female Yes vs No" has been categorised into 2 different variables (Female yes vs No and Female Yes vs No).

This happens also in other models with more independent variables included.

This does not happens in the baseline proc logistic without bootstrapping.

I sincerely thank You again for Your kind and precious support

Sincerely

Antonio

FreelanceReinh · Posted 07-01-2019 09:27 AM

Thanks for providing the requested outputs. There's nothing wrong with them. So, we've ruled out data and format issues.

Hence, it seems that variable Effect in ODS table CLoddsPL shows inconsistencies, but the reason is unclear. I wasn't able to replicate this behavior with SAS/STAT 14.3 (using a different input dataset, of course). I tend to believe that this is a bug (not the first bug I've seen in ODS output datasets), but luckily there's an easy workaround: Most likely the additional blanks between "Female" and "Yes vs No" in some of the Effect values are ordinary space characters, which can be removed with the COMPBL function:

data CL_boot_Mort_mod_1a;
set CL_boot_Mort_mod_1;
effect=compbl(effect);
run;

proc freq data=CL_boot_Mort_mod_1a;
tables effect;
run;

The PROC FREQ step with the revised dataset should show one category "Female Yes vs No" rather than two (and the unchanged category involving ChronicLungDisease). Otherwise you'd need to display the Effect values in $HEXw. format to find out what kind of blanks have been inserted (and use the COMPRESS function with appropriate arguments instead of COMPBL to remove them).

Then PROC UNIVARIATE, based on CL_boot_Mort_mod_1a, will use the consolidated CLASS level as well and the problem is solved.

Again, I think the OUTHITS option in PROC SURVEYSELECT is mandatory in your case to obtain valid bootstrap samples (i.e. with replacement) because you don't use variable NumberHits (of dataset Bootout) in the subsequent steps.

View solution in original post

FreelanceReinh · Posted 06-30-2019 01:38 PM

Hello @antor82 and welcome to the SAS Support Communities!

It looks like variable Female has more distinct values than expected. So, my first check would be:

proc freq data=dbsname;
format female hex16.;
tables female;
run;

Please post the output of the above step.

Since you're using formatted values of this variable, we should take a look at the definition of format YN. Can you show the SAS code which created that format or, if the code is not readily available, the output of the step below?

proc format lib=work fmtlib; /* Please replace "work" by the appropriate libref */
select yn;            /* (or libref.catalogname) if YN. is not in WORK.FORMATS. */
run;

Also, are you sure you don't need the OUTHITS option in your PROC SURVEYSELECT step? This is unrelated to the problem you've reported, but has the potential to invalidate your results.

[Edit: Included link to the documentation of OUTHITS.]

antor82 · Posted 07-01-2019 07:45 AM

Hi FreelanceReinhard

Thank You for Your comment.

This is the output from two different proc surveyselect

It looks like variable Female has more distinct values than expected. So, my first check would be:
proc freq data=dbsname;
format female hex16.;
tables female;
run;
Please post the output of the above step.

Since you're using formatted values of this variable, we should take a look at the definition of format YN. Can you show the SAS code which created that format or, if the code is not readily available, the output of the step below?
proc format lib=work fmtlib; /* Please replace "work" by the appropriate libref */
select yn;            /* (or libref.catalogname) if YN. is not in WORK.FORMATS. */
run;

Female is defined 1=yes and 0=no (this format is used also for other binary variables).

this is the output of the requested proc format

Tks again

A

FreelanceReinh · Posted 07-01-2019 09:27 AM

Thanks for providing the requested outputs. There's nothing wrong with them. So, we've ruled out data and format issues.

Hence, it seems that variable Effect in ODS table CLoddsPL shows inconsistencies, but the reason is unclear. I wasn't able to replicate this behavior with SAS/STAT 14.3 (using a different input dataset, of course). I tend to believe that this is a bug (not the first bug I've seen in ODS output datasets), but luckily there's an easy workaround: Most likely the additional blanks between "Female" and "Yes vs No" in some of the Effect values are ordinary space characters, which can be removed with the COMPBL function:

data CL_boot_Mort_mod_1a;
set CL_boot_Mort_mod_1;
effect=compbl(effect);
run;

proc freq data=CL_boot_Mort_mod_1a;
tables effect;
run;

The PROC FREQ step with the revised dataset should show one category "Female Yes vs No" rather than two (and the unchanged category involving ChronicLungDisease). Otherwise you'd need to display the Effect values in $HEXw. format to find out what kind of blanks have been inserted (and use the COMPRESS function with appropriate arguments instead of COMPBL to remove them).

Then PROC UNIVARIATE, based on CL_boot_Mort_mod_1a, will use the consolidated CLASS level as well and the problem is solved.

Again, I think the OUTHITS option in PROC SURVEYSELECT is mandatory in your case to obtain valid bootstrap samples (i.e. with replacement) because you don't use variable NumberHits (of dataset Bootout) in the subsequent steps.

antor82 · Posted 07-01-2019 11:55 AM

Thank You FreelanceReinhard for Your support!

Sincerely

Antonio

FreelanceReinh · Posted 07-01-2019 12:32 PM

You're welcome. I had one more idea while I wasn't able to access the SAS website for a while:

You may want to make sure that the unexpected discrepancies between Effect values did not occur within a replicate. (This is unlikely, but it would possibly indicate a more serious issue.) This would lead to duplicate Replicate-Effect combinations in the revised dataset CL_boot_Mort_mod_1a. So, if the PROC SQL step below created a non-empty dataset MYST, we should be alarmed.

proc sql;
create table myst as
select * from CL_boot_Mort_mod_1a
group by replicate, effect
having count(*)>1;
quit;

But most likely it will result in:

NOTE: Table WORK.MYST created, with 0 rows and 6 columns.

antor82 · Posted 07-01-2019 02:46 PM

So it did

antor82 · Posted 07-01-2019 03:07 PM

In my analysis, I've run three different logistic regression models (1-only baseline variables; 2-baseline+procedure-related variables; 3-baseline+procedure-related+postoperative-complications). Then do a bootstrap resampling.

How is it possible to have such results? (I'm posting only some examples)...

Model 2

Female Yes vs No OR 5.1 95%CI 2.6-12.2 (OR similar to Model 1)

Procedure OR 49085.6 95%CI 6.1-75.4 (such big OR???? OR greater than upper 95%CL????)

Model 3

Female Yes vs No OR 10.1 95%CI 3.1-40.5 (OR so far from Model 2?????)

FreelanceReinh · Posted 07-02-2019 04:48 AM

My first step would always be univariate logistic regressions (or, in the case of categorical predictors, contingency table analyses) to select candidate variables for a multivariable model.

Adjusting for other variables can change the odds ratio for a predictor considerably.

The extremely large OR requires further investigation (see also suspicious log messages, e.g., "quasi-complete separation of data points"). I think with clodds=wald the point estimate would always be within the confidence limits. If Procedure is a continuous variable, the OR depends on the measurement unit (cf. UNITS statement). I'd take a look at the joint distribution of this and the dependent variable.

antor82 · Posted 07-02-2019 07:41 AM

My first step would always be univariate logistic regressions (or, in the case of categorical predictors, contingency table analyses) to select candidate variables for a multivariable model.

Already done. Only significantly associated variables have been included in the models.

"quasi-complete separation of data points"

Yes, it happens. I'm trying to solve this with penalised regression models (firth options in the model statement).

However, some variables have OR <0.0001 or >999.999. (less frequently with firth option, but present anyway....) This greatly influence my models.

I guess I would probably better redefine the variables included in the models to avoid separation.

SAS Innovate 2025: Call for Content