BookmarkSubscribeRSS Feed
khairul
Calcite | Level 5
I have tried the following negative binomial mixed model (no idea how to run for ZINB) and it worked well when used 1% sample (around 700,000 obs), but got this error when ran the full sample (70 million obs):
"ERROR: Model is too large to be fit by PROC GLIMMIX in a reasonable amount of time on this system. Consider changing your model."
 
proc glimmix data=nis2.nis_2003_15N02 noitprint noclprint method=quad; /* No. of Procedures (NPR) Model */
class HOSP_NIS HIV(ref=first) YEAR(ref=first) FEMALE(ref=first) RACEcat (ref="white")
PAYER1(ref="Private_") PL_UR4(ref="Large Metro") ZIPINC_QRTL(ref=first) AWEEKEND(ref=first) ELECTIVE(ref=first) HOSP_BEDSIZE(ref=first) HOSP_LOCTEACH (ref=first) HOSP_REGION(ref=first);
model NPR = HIV|Age_c10 YEAR FEMALE RACEcat PAYER1 PL_UR4 ZIPINC_QRTL AWEEKEND ELECTIVE HOSP_BEDSIZE HOSP_LOCTEACH HOSP_REGION / s link=log dist=negbin ALPHA=0.01 obsweight=DISCWT;
random intercept / subject=HOSP_NIS;
run;
 
I have tried some optimization techniques, but none of them worked out. I have tried PROC GENMOD but found ZINB is not supported. 
Since it's hospital data (from around 10,000 hospitals) with correlated outcomes, I need to consider random effects. Again, this is a survey data, so I need to consider survey weight (weights for each hospital stay). AND the distribution of the dependent variable (NPR) is ZINB. Independent variable HIV is binary and age (per 10 years) is continuous. 
 
Thus, it would be great if anybody can suggest the appropriate method to run the ZINB Mixed Model when I want survey-weighted estimates? 
3 REPLIES 3
SteveDenham
Jade | Level 19

I am afraid that I have never seen a SASSTAT proc that addresses all 3 of your concerns (ZINB distribution, random effect, survey weights), but there may be a way.if you have a license to SAS/ETS.  The COUNTREG procedure can do zero-inflated count distributions and fixed and random effects, but I am not sure it can do both at once.and the random effect at least in the details section looks like it only applies to panel data (measures over time).  So, you might consider the following plan.  See if it makes sense.

 

Your final analysis would be done in GENMOD with a zero-inflated negative binomial distribution (see the documentation on how to do this). GLIMMIX doesn't do a good job on this unless you program the link and deviance functions yourself to include the logit that accounts for the zero inflation probability.

 

To get adequate weights will probably require some programming statements along the way.  To get started on the weights, look through the documenation for PROC SURVEYMEANS.  There is code in there that could generate survey weights (provided you have population numbers and number of observations in each category of the survey). 

 

That leaves the random effect. You say your sample of hospitals is about 10,000. To someone used to small sample size analyses, that seems like a lot. How closely does it approximate the population size?  If it is fairly close (and 'fairly close' is poorly defined) to the population size, you won't hurt the analysis much by not considering the random effect, and just fitting HOSP_NIS as a fixed effect.

 

Now you are in a position to use GENMOD.  If there is still a problem with fitting along the Model too large lines, consider which of the variables are "noise absorbers" and which are not. If they are not, then delete them from the MODEL statement. I will grant that you may not know this before the analysis. A good way to find out might be to look at boxplots for the levels of the candidate variables.  If they are relatively uniform with respect to the other variables, then they aren't adding much in the way of information to the model.

 

I realize this is kind of all over the place, but this is the first mash-up of techniques that cover most of your concerns.

 

SteveDenham

khairul
Calcite | Level 5

Thank you much, @SteveDenham, for a detailed response.

 

I have tried both COUNTREG & GENMOD, but none of them gave me a chance to incorporate random effects for ZINB distribution (data is not a panel data). Good thing is that I already have a weighting variable (DISCWT) provided with the dataset, so no need to generate it. 

# of hospitals is large because this is the 13 years' pooled data (less than 1,000 hospital records each year), which represents around 20% of U.S. community hospitals. Do you think this is large enough to be considered as 'fairly close' to population size? I am afraid if I can remove hospital ID (HOSP_NIS) from random effects since discharge records are largely varied across the types/size of the hospitals. Moreover, there are multiple hospitalizations by the same patients (unfortunately, data do not have patient identifiers). By the way, what did you mean by using HOSP_NIS in the fixed effects model? Using it as a covariate, or just remove it from the model?

Besides, I have run another model for a continuous (normally distributed) dependent variable (logged costs) using PROC GENMOD. The model worked fine when running with a 1% sample, but kept running for 40 hours (then I canceled) when used full sample (around 70 million). The codes were like that:

 

proc genmod data=nis2.nis_2003_15N02 ; /* final costs model*/
class HOSP_NIS YEAR(ref=first) FEMALE(ref=first) RACEcat (ref="white") PAYER1(ref="Private_") PL_UR4(ref="Large Metro") ZIPINC_QRTL(ref=first) AWEEKEND(ref=first) ELECTIVE(ref=first) HOSP_BEDSIZE(ref=first) HOSP_LOCTEACH (ref=first) HOSP_REGION(ref=first);
model COSTS02_log = HIV|Age_c10 YEAR FEMALE RACEcat PAYER1 PL_UR4 ZIPINC_QRTL AWEEKEND ELECTIVE HOSP_BEDSIZE HOSP_LOCTEACH HOSP_REGION / dist=gamma link=log ALPHA=0.01;
repeated subject=HOSP_NIS / type=exch;
weight DISCWT;

run;

 

Is the issue with data size only, or I need to use any type of optimization technique when running the model with the full sample? 

Can you explain a little more how can I identify "noise absorbers" variables (given that all, but age, independent variables are categorical)? And why I need to do this?

Going back to the ZINB mixed model, repeated (/random) statements are not working in GENMOD when used 'dist=zinb'. So, would you suggest anything else? I am not an expert with all these, so writing programs is not an option for me! 😞 

I need at least an NB mixed-effect model for my other two dependent variables that do not have zero-inflation (# of diagnosis, length of stays). 

Thanks again for your help. 

 

SteveDenham
Jade | Level 19

Judging from your replies to several issues, I think I would recommend:

 

ZINB using PROC GENMOD using the weighting you mention, but including HOSP_NIS as an effect in the model statement, and removing the repeated/random approach.

 

The other thing is the size of the dataset and the complexity of the model.  Have you tried running a model on a random subset of the data, and seeing how long that takes?  If it helps a lot to fit 1% of your data, then consider model averaging as an approach.  Get 100 or so random samples and fit them separately.  Save the results for the parameters, then resample those to get an averaged value and standard error.

 

SteveDenham

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 704 views
  • 2 likes
  • 2 in conversation