11-07-2012 09:40 AM
I am trying to analyze a huge dataset with 87,000 subjects. Each subject has 4 repeated measures (1 week, 3 weeks, 6 weeks, 2 months). The model we are developing has both G-side and R-side random effects (i.e. repeated measures and additional random effects). When I try to run the analysis, GLIMMIX gives a "Did not converge" error. I have tried all of the options GLIMMIX offers for controlling the optimization process e.g. changing the convergence criterion (pconv = 1e-4), increasing the maxiter parameter, and playing around with the options for controlling the inner iterations (available through the nloptions statement) but none of this works. But I noticed that if I take a small random sample of 6000 subjects (from the original 87000 sample size) and run the same model on this sub-sample, GLIMMIX converges. So I suspect the issue is that the sample size (87000 subjects) is just too large. Also, G-side random effects has 120 levels and the other has 8 levels, so the Z matrix has 128 columns in total. Coupled with the large sample size, this may be the source of the problem I'm experiencing. Is there an upper limit on the sample size and Z matrix size GLIMMIX can handle? I know the MIXED procedure has a high-performance version (for huge datasets) called HPMIXED, does GLIMMIX have this as well?
A quick description of the study: The study is a multi-center trial (8 different hospitals) aimed at gauging the effectiveness of doctors strongly recommending that their patients quit smoking. For each patient, the doctor recommendation was given only once (during one of the patient's visit to the hospital). Each patient was then asked about their smoking status 1 week, 3 weeks, 6 weeks, and 2 months (hence the 4 repeated measures) after receiving the doctor recommendation. These 4 repeated measures form the R-side component of the model. We included the variables 'hospital' and 'doctor' (nested within 'hospital') as random effects and these formed the G-side component. As I mentioned earlier, the variable 'doctor' has 120 levels and the second G-side variable 'hospital' has 8 levels, producing a rather huge Z matrix with 128 columns.
The code is given below:
proc glimmix data=lib.data pconv=1e-4;
class agegroup gender time user doctor hospital subjectID;
model QuitSmoke(event='1') = agegroup gender time user time*user /dist=binary link=logit solution covb;
random time /subject=subjectID residual type=ar(1);
random doctor(hospital) hospital;
nloptions technique=congra maxiter=1000 gconv=1e-4;
Any help on this would be greatly appreciated!
11-07-2012 02:34 PM
What is your data set sorted by? If subjectID is a numeric column you may want to sort it by subjectID and then remove it from your class statement and modify your random statement appropriately.
This is explained with some other great tips at http://support.sas.com/resources/papers/proceedings12/332-2012.pdf
Hope this helps...
11-08-2012 11:00 AM
By far, Michelle's suggestion is what you need to do. Of all the dist= options in GLIMMIX, I find dist=binary to lead to the most problems with convergence. I have felt the pain that is evident in that NLOPTIONS statement and pconv= option.
Is it possible that subjects are nested within doctors as well? If so you might be able to coalesce responses into a binomial distributed variable that calculates the proportion of subjects that have quit for each doctor.
If, however, you still run into convergence problems, you might consider pre-processing your dependent variable by applying the logit transform (or arcsin(sqrt(y), or whatever good transformation that may have been referenced prior to the development of generalized linear models) and then moving to HPMIXED. And just to make sure I understand everything correctly, the variable 'user' carries the information of whether the doctor recommended strongly to quit, correct?
11-09-2012 07:21 AM
Hi Steve, thanks for your response! I will look into whether it's feasible to make doctors (using the proportion of patients they convinced to quit as the response) rather than patients the basic units of the analysis. Also, the logit transform is a good idea, that may have to be my last resort if this dataset proves to be beyond the capabilities of GLIMMIX. Thanks!
11-08-2012 09:37 PM
Thanks Michelle. Currently the dataset is sorted by timepoint and within each timepoint, the subjects are sorted by their subjectID (i.e. in PROC SORT I used "by time SubjectID"). Yes, SubjectID is a numeric variable. I am not exactly sure how the order in which my dataset is sorted affects convergence in the GLIMMIX procedure. As you suggested, I can sort by SubjectID and remove it from the class statement but I'm not sure what you mean when you say I should "modify the random statement appropriately". My study is a repeated measures study, each subject (represented by SubjectID) has measurements at 4 timepoints (so there are 4 records per SubjectID). Therefore I find it necessary to include the "random time /subject=subjectID residual type=ar(1)" statement in order to specify the correlation structure of among the subjects' 4 repeated measures. I'm not sure I see any other way of stipulating this structure in the model.
Thanks for the article you included, it was really useful, one of the examples (the 3-level HLM model) made me realize that I could recode the "doctor" variable so that it was properly nested within the "hospital" variable, and this led to a new "doctor" variable with much fewer levels (65 [down from 120]). However, this model didn't converge either.
11-08-2012 10:28 PM
My reference to modifying the random statement appropriately was because I thought you were running proc mixed. Sorry... Great to hear that the article helped with your new "doctor" variable. As a thought perhaps sort by SubjectID time and then remove SubjectID from the class statement and see how that goes. Another suggestion is to specify a alternate method for computing the denominator degrees of freedom for the tests of fixed effects resulting from the model. As you using a AR(1) covariance matrix structure, then perhaps the Kenward Rogers method with a firstorder suboption will help so that second derivatives from the calculation of the covariance matrix adjustment are not calculated which may help with the computation... so your model statement becomes:
model QuitSmoke(event='1') = agegroup gender time user time*user /dist=binary link=logit solution covb ddfm=kr(firstorder);
As documented at https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_glimmix_a00... (and click on DDFM).
11-09-2012 07:39 AM
To add to Michelle's comment regarding Kenward-Rogers degrees of freedom: If you are on SAS/STAT12.1, try ddfm=kr2. The algorithm has been updated and is much more stable. However, I don't think it will help with convergence. My take on the process flow is convergence to stable variance-covarince matrix, estimate parameters, apply denominator degrees of freedom for calculation of tests and intervals. If you have not achieved convergence, the other stuff doesn't get done. No reason not to apply the better estimate though, as you will get convergence eventually. I really think the coalescing to doctor as the unit of analysis should get this done (and if not, then maybe doing this piecewise by timepoint to see what might be messing things up is in order).