02-02-2017 05:00 PM
I am new to utilizing SAS 9.4 full time for analyses. I am working on my dissertation data and I know that the best way to analyze my data is a generalized linear mixed model. No one else in my lab utilizes SAS, they prefer SPSS, which does not to as good a job with very complex stats in my opinion.
My data is a secondary analysis of a year long dataset on women who were undergoing an intervention to regain menstrual function. My Y (resume) is did they or didn't they resume menses during the study.
I am interested in whether body composition (of which I have 12 variables) changed the odds of menstrual recovery and am not interested in the intervention effect (rnd) for this analysis. My main concern is fitting the model so that the subject-specific random intercept is assumed and uneven repeate structure of the data (and number of measures per subject) is considered in the analysis.
I am not sure the solution I have come up with following extended reviews of the GLIMMIX literature reflects the analysis I am interested in.
Proc Glimmix data=resumption;
Class period id rnd;
Model resume = wt bmi perbf fm fmi lmi lpct tpct tlpctr apct gpct agpctr/ link=logit dist=binomial or solution;
Random int/subject=id type=ar residual;
My data is set up like:
Options nocenter pageno=1;
Input period $ id rnd wt bmi perbf fm fmi lmi lpct tpct tlpctr apct gpct agpctr resume;
The time periods I have body composition variables are not evenly distributed (screening, and intervention weeks 5, 9, 21, 33, and 49). Duration of time to resumption is not of interest in my analysis. Participants that resumed menstrual function did so at various time points, and many did not resume. Not all of the participants made it through the study to intervention week 49.
Not all women have the apct, gpct, and agpctr variables due to a change in the machine used to evaluate body composition. The apct and gpct have shown to be significantly different at the time of resumption compared to non resumers when analyzed with a Hotelling's T2 test, therefore I want to keep these variables in the analysis.
I understand that most of my variables are correlated; however, only 7 of the variables are correlated above a rho = 0.95 and none are at 0.99.
Thanks for any support!
02-04-2017 01:40 PM
I have a total of 30 participants in the analysis I am completing (an approximatly even split of those who resumed and those that did not). I chose those specific participants to show the variability in when participants withdrew or resumed menses in the study. The information provides (though a representation of the actual data) shows the randomness of the data that is missing for the various variables within any one participant.
02-04-2017 02:32 PM
I'll offer some thoughts. Hopefully other people will weigh in as well.
1. Consider a "time to event" analysis, where event is resumption of menses. Given the extent of censored observations (i.e., women who withdrew, and women who did not resume menses before the end of the study), I doubt that a GLMM with a binary response will work well. The text by Singer and Willett (Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence) and the code here http://www.ats.ucla.edu/stat/r/examples/alda/ could be a good place to start.
2. 30 subjects, about half of whom exhibited the event, is a small sample. Keep in mind your effective sample size and the risk of over-fitting.
3. With correlations that high among the body condition covariates, you will surely have issues with (multi)collinearity. Consider a more complete assessment of multicollinearity, and then some form of dimension reduction, e.g., dropping predictor variables, forming an index, principal components analysis, factor analysis. Again, keep in mind your effective sample size and the risk of over-fitting.
4. You say you are not interested in the "intervention" effect. Are different women exposed to different interventions in your dataset? If so, I would not think that you could ignore it unless intervention absolutely does not matter.
5. The missing covariate values will pose problems. Consider imputation? Or drop those covariates with missing values if they are highly correlated (and thus redundant) to other covariates.
6. These thoughts are all independent of the software that you use, whether SPSS or SAS (or anything else).
7. I suspect this analysis will be a challenge, quite probably more than can be dealt with adequately in this forum. Consider finding someone at your institution with statistical expertise to guide you.
02-04-2017 03:23 PM
Thanks for the input.
1. We are working with trying to understand whether or not time to event matters. In the current analysis we want to proceed under the assumption that time to event doesn't matter, what matters is the actual body composition no matter how long it took the participant to reach that value. In another analysis we are going to evaluate the time to event aspect of the study.
2. Yes, we are considering overfitting. We are fitting the GLMM based on the results of Correlation analyses with resumption as well as Hotelling T2 analyses. This way I have dropped to 7 predictors.
3. I am limited in my knowledge of ways to evaluate mulitcollinearity. I know to check correlations, however the value of rho to be concerned at varies. I know I can also look at VIF values as well. Do you have suggestions on the best/most appropriate way to evaluate multicollinearlity.
4. The women went through a freeliving refeeding intervention, however women in both arms resumed. My focus is the body composition changes, which independent of the intervention in some respects. This is a secondary analysis, I can add the randomized group in as a predictor but then I again risk over fitting.
5. The one predictor of interest is missing in a subset of the sample due to a change in the machine being used. This data cannot be imputed.
6/7. I am working with an individual at my institution on the best analysis to complete, however he does not uses SAS and suggested using the SAS forum for guidance on the structure of the SAS coding. As a student I am trying to explore the statistics independently as well as with his guidance.
02-04-2017 04:41 PM
Thank you for the informative responses. My turn!
1. Because of the censoring and because the body condition covariates are time-varying, I'm not seeing how a good binary GLMM can be constructed. The resume values don't switch back and forth across periods between N and Y depending on body conditions; a resume value is N until it (possibly) becomes Y. Looks like event data to me.
2 and 3. Your approach to variable selection is arguably treacherous. But I concur that you have too many predictor variables for your sample size. Even with this process of down-sizing to 7 variables, you still need to assess multicollinearity among them. A quick Google of "multicollinearity detection" generates a lot of useful links. Multicollinearity is a property of the predictor variables, so you can read about multicollinearity in regression texts that deal with normal-distribution response and use multicollinearity statistics generated by regression software (like REG).
4. If body conditions are largely a consequence of intervention, then it would be appropriate to drop intervention.
5. By default, if an observation contains a missing value for one or more predictor variables, the software will drop that observation in the model fitting; this is true for most (if not all) software packages. So, you need to deal with those missing values in some fashion, or you'll lose those observations.
6 and 7. I commend you for not blinding using stat software! This forum is usually good to help with code when you provide enough detail to identify an appropriate model for a corresponding dataset. In this case, as you can tell, I don't believe a binary GLMM is appropriate so I cannot help with syntax for that model (or should not, but see below). If you switch to a time-to-event model, this website which I linked to earlier http://www.ats.ucla.edu/stat/r/examples/alda/ has SAS code examples, and the forum could help with code if you hit snags.
In the interest of your ongoing education, I will add that if I liked your approach, which I don't, I would start with
proc glimmix data=resumption method=laplace; class period id; model resume = wt bmi perbf fm fmi lmi lpct tpct tlpctr apct gpct agpctr period / link=logit dist=binomial or solution; random int / subject=id; random period / subject=id type=ar(1); run;
This sort of model almost always requires some twiddling, even beyond that needed to identify a good covariance structure type. Including "type=ar(1)" and "residual" on the first random statement is wrong. Including both random statements as here identifies an AR(1)+RE covariance structure. Omitting the first random statement and retaining only the second identifies an AR(1) covariance structure. See Littell et al. http://onlinelibrary.wiley.com/doi/10.1002/1097-0258(20000715)19:13%3C1793::AID-SIM482%3E3.0.CO;2-Q/... or Stroup (2013) Generalized Linear Mixed Models for details about the distinction between AR(1)+RE and AR(1). If you had enough subjects and enough periods and more complete longitudinal data, you could consider a model with random slopes, specified as
random wt bmi perbf fm fmi lmi lpct tpct tlpctr apct gpct agpctr / subject=id type=vc;
but this is way too complex for what you have to work with.