BookmarkSubscribeRSS Feed
mconover
Quartz | Level 8

Hello and thank you for reviewing my question.  I am currently trying to use SAS to conduct a g-computation analysis, estimating the effect of statin initiation (exposure variable: StatinInitiator) on an all-cause mortality outcome (outcome variable:status_DEATH). I am not permitted to share any data but to describe it, this is patient-level data where each row in the dataset refers to a distinct patient-level observation containing information on exposure, outcome, and baseline covariates. I am trying to figure out the correct models and SAS procedures to use for modeling my outcome and would appreciate any help I can get.  First some background - as part of the g-computation method, we model the outcome as a function of each subject's exposure and covariate distributions, which we have observed for all subjects. Then we use these models to estimate each subject's outcome probability under both exposures.

 

I have been able to make this work by fitting GEE models using PROC GENMOD (which I was using with the REPEATED SUBJECTS= statement because in my study subjects can appear in both exposure groups). However, I am having some issues with quasi-complete non-convergence, which led me to change the DIST=BINOMIAL to DIST=POISSON. Since this allows probabilities >1 I have recently been advised by a mentor to switch back to DIST=BINOMIAL. To address the persistent non-convergence issues, I was also advised to use Firth's bias correction. However, my understanding is that the only SAS procedure that can implement Firth's bias correction is PROC LOGISTIC (FIRTH option in the MODEL statement). However, I am now unclear how to account for the correlated observations since PROC LOGISTIC has no REPEATED SUBJECTS= statement.  Can anyone provide guidance regarding what SAS procedure I can use to implement a logistic model with Firth's bias correction which properly accounts for the correlated observations?

 

Below I have included my PROC GENMOD code.  Please let me know if I can clarify anything above or address any questions that would make my question more clear.  Please note that I am also trying use this method to calculate risk ratios and risk differences so I have some macro language in my PROC GENMOD code which would toggle the settings necessary for each estimate.

 

Thank you for any guidance you can provide.  In case it is relevant, I am using SAS version 9.4.

 

CODE:

ods listing exclude all;
ods output
GEEEmpPEst = paramDS;
proc genmod data= input_dataset descending;
weight weight_var;
class bene_id
         AGECAT (PARAM=REF REF="2")
         GENDER (PARAM=REF REF="0")
         YEAR (PARAM=REF REF="2011")

         RACE (PARAM=REF REF="1")
         OUTPTVISIT_1yr_cat (PARAM=REF REF="5")
         SNF_1yr_cat (PARAM=REF REF="0")
         HS_1yr_cat (PARAM=REF REF="0")
         UniqueDrugs_1yr_cat (PARAM=REF REF="5")
         ldl_1yr (PARAM=REF REF="<100")
         sbp_1yr (PARAM=REF REF="<130")
         dbp_1yr (PARAM=REF REF="<80");

MODEL status_DEATH = StatinInitiator

             AGECAT
             YEAR RACE
             OUTPTVISIT_1yr_cat
             SNF_1yr_cat
             HS_1yr_cat
             UniqueDrugs_1yr_cat
             /*Continuous variables*/
             AGEyrs Age_sq
             SNF_1yr
             UniqueDrugs_1yr
             HS_1yr
             OUTPTVISIT_1yr
             /*Binary variables*/
             AFIB_1yr
             AMBLIFESUPPORT_1yr
             ANEMIA_1yr
             ANGIOGRAPHY_1yr
             ARB_1yr
             ASTHMA_1yr
             CANCERSCREEN_1yr
             CKD_1yr
             COLONOSCOPY_1yr
             COPD_1yr
             DEMENTIA_1yr
             DIURETICS_1yr
             ECHOCARDIOGRAPH_1yr
             FECALOCCULT_1yr
             GENDER
             HOMEOXYGEN_1yr
             HSCRP_1yr
             HYPERLIPIDEMIA_1yr
             INCL_ENDARTERECTOMY
             INCL_STROKE
             INFLAMBOWEL_1yr
             INSULIN_1yr
             LIPIDPANEL_1yr
             OBESITY_1yr
             OSTEOARTHRITIS_1yr
             PARALYSIS_1yr
             PCD_1yr
             PSYCHIATRIC_1yr
             PVD_1yr
             SEPSIS_1yr
             SMOKING_1yr
             STRESSTEST_1yr
             SUBABUSE_1yr
             SULFONYLUREA_1yr
             THIAZIDE_1yr
             VERTIGO_1yr
             VTE_1yr
             WEAKNESS_1yr
             WHEELCHAIR_1yr
             / link= %IF &measure=RR %THEN logit; %ELSE %IF &measure=RD %THEN identity;
             dist= poisson maxiter=250;
repeated subject=bene_id / type=ind;
output out=out_data(keep=StatinInitiator bene_id gender age_bin probability status_DEATH weight;)
prob=probability;
run;

3 REPLIES 3
StatDave
SAS Super FREQ

Since your response is binary, the binomial distribution is the one most appropriate. The separation problems are undoubtedly occurring because the data is being made too sparse by the very large number of predictors (and therefore parameters) in your model. Before getting to estimating the risk difference or relative risk, you should try to find a much simpler model that fits adequately well. You probably have an idea of the variables most likely to be important in predicting the response. Start with a model with just the few most important variables and add more to your GEE model as can be supported until you have a model that fits well. If you want, you could use a model selection method in LOGISTIC or HPGENSELECT after selecting a set of observations that are independent (only one from each subject). Once you have a GEE model that fits and doesn't cause separation, then you can use the NLMeans macro for estimating the relative risk or the risk difference.

mconover
Quartz | Level 8

Thanks for the response StatDave_sas and for considering my question. I just wanted to clarify a few points.

 

I realize that I have a lot of model predictors but I should note that I have quite a lot of data with a substantial number of outcomes.  I do understand that a model with fewer predictors will be more likely to converge and that is certainly a solution I plan to explore further. However, the person who was advising me seemed to think Firth's bias correction may resolve the problem before it was necessary to start eliminating predictors from the model. Perhaps they were mistaken in that understanding? Furthermore, my study design requires me to select more than one observation from each subject (i.e. one observation per subject per exposure level) for reasons I won't get into here. Thus, I'm not sure I feel comfortable selecting only one observation per subject or proceeding with a model that doesn't somehow account for this correlation when estimating the variance.  

 

However, so far I haven't found any solution which will allow me to implement Firth's bias correction (which I believe can only be implemented as an option in the PROC LOGISTIC MODEL statement) while also accounting for the the correlated observations (since PROC LOGISTIC doesn't have a REPEATED SUBJECTS= option). -- I may be misunderstanding this so if anyone else has any ideas or recommendations let me know. --  In absence of such a solution, I will try your recommended approaches for reducing the number of model predictors.  Thanks again for your helpful advice!

StatDave
SAS Super FREQ

Firth's method involves applying a penalty to the likelihood. Since GEE is not a likelihood-based method, Firth's method is not possible.

 

Even with a lot of data, sparseness can easily occur when no responses of one type appear in one particular cross-classification of all of the predictors. 

 

The idea of using one observation per subject was just a way to use a model selection process in PROC LOGISTIC or PROC HPGENSELECT to discover which predictors might be the most important ones. With that info you could fit the GEE model using the relatively few important predictors.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1229 views
  • 5 likes
  • 2 in conversation