BookmarkSubscribeRSS Feed
righcoastmike
Quartz | Level 8

Hi All, 

 

I've been struggling to run PROC GLIMMIX on a very large dataset (70 variables and approx 130,000 records). It's a mixed effects model with nested random effects and a poission distribution. I have Records, nested in patients, nested in geographic regions. The big problem was trying to "class" by patients as there are so many of them it keps causing a memory error. After a lot of research I've found a code that gives me results, but I'm not totally confident that they are correct. In short, my code works but I can't tell if it works properly. If someone could take a look and see if my code makes sense it would be much appreciated. 

 

Here's what I'm running: 

 

 

proc glimmix data=work.dataset ic=q;
class geography patient_ID; 
model outcome_var = a b c d  / solution dist=poisson link=log;
random intercept / solution subject=geography;
random _residual_ / solution subject=patient_ID(geography);
covtest / wald; 
nloptions tech=nmsimp; 

Any isnights as to whether it looks like I can trust the results coming out of this code would be much appreciated. 

 

Thanks so much. 

3 REPLIES 3
sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

I think I don't like the second RANDOM statement because PATIENT_ID is not the "bottom" unit in the design (i.e. "residual" ); RECORD is. I don't know what you've tried, but this is where I would start:

 

proc glimmix data=work.dataset ic=q;
class geography patient_ID; 
model outcome_var = a b c d  / solution dist=poisson link=log;
random intercept / solution subject=geography;
random intercept / solution subject=patient_ID(geography);
run;

If there was evidence of overdispersion, you could add a scale parameter

 

proc glimmix data=work.dataset ic=q;
class geography patient_ID; 
model outcome_var = a b c d  / solution dist=poisson link=log;
random intercept / solution subject=geography;
random intercept / solution subject=patient_ID(geography);
random _residual_;
run;

or add an observation-level variance

 

proc glimmix data=work.dataset ic=q;
class geography patient_ID record; 
model outcome_var = a b c d  / solution dist=poisson link=log;
random intercept / solution subject=geography;
random intercept / solution subject=patient_ID(geography);
random intercept / solution subject=record(patient_ID geography); 
run;

or switch from Poisson to negative binomial, or to a generalized Poisson.

 

Speculating wildly and noting that GLIMMIX allows the random-effects to be either classification or continuous (see the documentation for the RANDOM statement in GLIMMIX), you could try incorporating patient_ID (and possibly geography) as continuous effects, i.e., remove them from the CLASS statement. Extrapolating from the documentation for the GROUP option on the RANDOM statement, a continuous random-effect might execute more quickly and use less memory; but I am just guessing here. For sure, you'd have to sort the dataset correctly.

 

Whoa, 70 variables is a lot. Variables a, b, c, and d are incorporated as continuous variables in your example model, so you're doing linear regression. Lots of challenges: linearity (on the link scale), multicollinearity possibilities, influential observations. You don't identify the level at which these predictors are observed (geography, patient or record). Regression in a mixed model is also known as a random coefficients model; the models above are only random intercepts: they assume that slopes have no random variance. You could assess random slopes--theoretically. In practice, if you are already having memory problems, adding more covariance parameters to the estimation list is not going to make your modeling life easier.

 

If you try the continous random effects thing, I'd be curious to know how that works out.

 

 

righcoastmike
Quartz | Level 8
Hi Sld,

Thanks so much for your thoughtful reply, I’ll give some of these things a try and get back to you. Your first example was how I had the code written originally and gave me the memory error, I think it’s because I have approximately 80,000 different subjects in my study and that’s a lot of levels for GLIMMIX to handle. I’ve read anything over 1,000 can be tricky.
That being said, your point about the patient_ID not being the “bottom” unit is well taken.

I’ll play with some things and see what I can do in terms of adding patient_ID as a continuous effect and report back!

Stay tuned,

Rightcoast.


sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

My clients do not have big data sets, to put it mildly, so memory issues are not my forte. I'm intrigued by incorporating random factors as continuous rather than classification. And if that didn't work...there are folks out there that deal with memory issues and I would not hesitate to touch base with SAS Tech Support.

 

Good luck and have fun!

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1403 views
  • 0 likes
  • 2 in conversation