Risk ratios & risk differences in correlated data w/ logistic model an...

mconover · Posted 07-08-2019 12:09 PM

Hello and thank you for reviewing my question. I am currently trying to use SAS to conduct a g-computation analysis, estimating the effect of statin initiation (exposure variable: StatinInitiator) on an all-cause mortality outcome (outcome variable:status_DEATH). I am not permitted to share any data but to describe it, this is patient-level data where each row in the dataset refers to a distinct patient-level observation containing information on exposure, outcome, and baseline covariates. I am trying to figure out the correct models and SAS procedures to use for modeling my outcome and would appreciate any help I can get. First some background - as part of the g-computation method, we model the outcome as a function of each subject's exposure and covariate distributions, which we have observed for all subjects. Then we use these models to estimate each subject's outcome probability under both exposures.

I have been able to make this work by fitting GEE models using PROC GENMOD (which I was using with the REPEATED SUBJECTS= statement because in my study subjects can appear in both exposure groups). However, I am having some issues with quasi-complete non-convergence, which led me to change the DIST=BINOMIAL to DIST=POISSON. Since this allows probabilities >1 I have recently been advised by a mentor to switch back to DIST=BINOMIAL. To address the persistent non-convergence issues, I was also advised to use Firth's bias correction. However, my understanding is that the only SAS procedure that can implement Firth's bias correction is PROC LOGISTIC (FIRTH option in the MODEL statement). However, I am now unclear how to account for the correlated observations since PROC LOGISTIC has no REPEATED SUBJECTS= statement. Can anyone provide guidance regarding what SAS procedure I can use to implement a logistic model with Firth's bias correction which properly accounts for the correlated observations?

Below I have included my PROC GENMOD code. Please let me know if I can clarify anything above or address any questions that would make my question more clear. Please note that I am also trying use this method to calculate risk ratios and risk differences so I have some macro language in my PROC GENMOD code which would toggle the settings necessary for each estimate.

Thank you for any guidance you can provide. In case it is relevant, I am using SAS version 9.4.

CODE:

ods listing exclude all;
ods output
GEEEmpPEst = paramDS;
proc genmod data= input_dataset descending;
weight weight_var;
class bene_id
AGECAT (PARAM=REF REF="2")
GENDER (PARAM=REF REF="0")
YEAR (PARAM=REF REF="2011")

RACE (PARAM=REF REF="1")
OUTPTVISIT_1yr_cat (PARAM=REF REF="5")
SNF_1yr_cat (PARAM=REF REF="0")
HS_1yr_cat (PARAM=REF REF="0")
UniqueDrugs_1yr_cat (PARAM=REF REF="5")
ldl_1yr (PARAM=REF REF="<100")
sbp_1yr (PARAM=REF REF="<130")
dbp_1yr (PARAM=REF REF="<80");

MODEL status_DEATH = StatinInitiator

AGECAT
YEAR RACE
OUTPTVISIT_1yr_cat
SNF_1yr_cat
HS_1yr_cat
UniqueDrugs_1yr_cat
/*Continuous variables*/
AGEyrs Age_sq
SNF_1yr
UniqueDrugs_1yr
HS_1yr
OUTPTVISIT_1yr
/*Binary variables*/
AFIB_1yr
AMBLIFESUPPORT_1yr
ANEMIA_1yr
ANGIOGRAPHY_1yr
ARB_1yr
ASTHMA_1yr
CANCERSCREEN_1yr
CKD_1yr
COLONOSCOPY_1yr
COPD_1yr
DEMENTIA_1yr
DIURETICS_1yr
ECHOCARDIOGRAPH_1yr
FECALOCCULT_1yr
GENDER
HOMEOXYGEN_1yr
HSCRP_1yr
HYPERLIPIDEMIA_1yr
INCL_ENDARTERECTOMY
INCL_STROKE
INFLAMBOWEL_1yr
INSULIN_1yr
LIPIDPANEL_1yr
OBESITY_1yr
OSTEOARTHRITIS_1yr
PARALYSIS_1yr
PCD_1yr
PSYCHIATRIC_1yr
PVD_1yr
SEPSIS_1yr
SMOKING_1yr
STRESSTEST_1yr
SUBABUSE_1yr
SULFONYLUREA_1yr
THIAZIDE_1yr
VERTIGO_1yr
VTE_1yr
WEAKNESS_1yr
WHEELCHAIR_1yr
/ link= %IF &measure=RR %THEN logit; %ELSE %IF &measure=RD %THEN identity;
dist= poisson maxiter=250;
repeated subject=bene_id / type=ind;
output out=out_data(keep=StatinInitiator bene_id gender age_bin probability status_DEATH weight;)
prob=probability;
run;

StatDave · Posted 07-08-2019 06:13 PM

Since your response is binary, the binomial distribution is the one most appropriate. The separation problems are undoubtedly occurring because the data is being made too sparse by the very large number of predictors (and therefore parameters) in your model. Before getting to estimating the risk difference or relative risk, you should try to find a much simpler model that fits adequately well. You probably have an idea of the variables most likely to be important in predicting the response. Start with a model with just the few most important variables and add more to your GEE model as can be supported until you have a model that fits well. If you want, you could use a model selection method in LOGISTIC or HPGENSELECT after selecting a set of observations that are independent (only one from each subject). Once you have a GEE model that fits and doesn't cause separation, then you can use the NLMeans macro for estimating the relative risk or the risk difference.

mconover · Posted 07-09-2019 12:34 PM

Thanks for the response StatDave_sas and for considering my question. I just wanted to clarify a few points.

I realize that I have a lot of model predictors but I should note that I have quite a lot of data with a substantial number of outcomes. I do understand that a model with fewer predictors will be more likely to converge and that is certainly a solution I plan to explore further. However, the person who was advising me seemed to think Firth's bias correction may resolve the problem before it was necessary to start eliminating predictors from the model. Perhaps they were mistaken in that understanding? Furthermore, my study design requires me to select more than one observation from each subject (i.e. one observation per subject per exposure level) for reasons I won't get into here. Thus, I'm not sure I feel comfortable selecting only one observation per subject or proceeding with a model that doesn't somehow account for this correlation when estimating the variance.

However, so far I haven't found any solution which will allow me to implement Firth's bias correction (which I believe can only be implemented as an option in the PROC LOGISTIC MODEL statement) while also accounting for the the correlated observations (since PROC LOGISTIC doesn't have a REPEATED SUBJECTS= option). -- I may be misunderstanding this so if anyone else has any ideas or recommendations let me know. -- In absence of such a solution, I will try your recommended approaches for reducing the number of model predictors. Thanks again for your helpful advice!

StatDave · Posted 07-09-2019 02:22 PM

Firth's method involves applying a penalty to the likelihood. Since GEE is not a likelihood-based method, Firth's method is not possible.

Even with a lot of data, sparseness can easily occur when no responses of one type appear in one particular cross-classification of all of the predictors.

The idea of using one observation per subject was just a way to use a model selection process in PROC LOGISTIC or PROC HPGENSELECT to discover which predictors might be the most important ones. With that info you could fit the GEE model using the relatively few important predictors.

Risk ratios & risk differences in correlated data w/ logistic model and Firth's correction?

Re: Risk ratios & risk differences in correlated data w/ logistic model and Firth's correction?

Re: Risk ratios & risk differences in correlated data w/ logistic model and Firth's correction?

Re: Risk ratios & risk differences in correlated data w/ logistic model and Firth's correction?

Risk ratios & risk differences in correlated data w/ logistic model and Firth's correction?

Re: Risk ratios & risk differences in correlated data w/ logistic model and Firth's correction?

Re: Risk ratios & risk differences in correlated data w/ logistic model and Firth's correction?

Re: Risk ratios & risk differences in correlated data w/ logistic model and Firth's correction?

SAS Innovate 2025: Call for Content