I have a complicated modeling problem and have been struggling with astronomical runtimes on top of narrowing in on correct model specification. Quick background without getting too specific: I have data over time for all people that are part of a huge organization. These people are part of large, geographically distinct groups, and within those groups, they are members of smaller units (varying in size between 5 people to 500 at any one time). People move between units and between groups, but not that often. I have 8 years of data for 1.9 million people. Because of the size of the dataset, I've chosen my time unit to be a quarter (~90 days), so my unit of analysis is a person-unit-quarter. If a person is in two units in a quarter, he has two rows in my dataset, each indicating the number of days in the quarter that he spent in that unit. Right now, I'm stratifying by group and running a separate model for each group to further decrease the sample size and runtime. Example of data structure: Person Quarter Group Unit Days 1 1 X A 90 1 2 X A 40 1 2 X B 50 2 1 X A 2 2 1 X C 30 2 1 X A 58 For each person at each time point, I am calculating their cumulative exposure to a specific experience as well as their unit's cumulative exposure to the same experience (defined as the sum of the cumulative exposure of everyone else in their unit during that quarter). I'm trying to model the relationship between these two exposures (theirs and their peers') to the risk of having a specific kind of mental health medical visit during the quarter (0/1). "Does an increase in a person's unit's cumulative exposure increase the person's probability of having a visit even when controlling for their personal cumulative exposure?" So my model is a multilevel logistic regression that I'm running in glimmix with random effects at the unit and at the person level. I am specifying the autoregressive covariance structure on the individual-level random effect because I have repeated measures for each person over time. I'm also controlling for the number of days spent in the unit/quarter because the more time spent in a unit/quarter, the more likely a person is to experience a visit. Here is a simplistic version of my code: proc glimmix data=DATA; class unit_id; model visit (event="1") = days_in_unit_quarter personal_exposure unit_exposure other_covariates year_dummies / dist=binary link=logit solution; random intercept / subject=unit_id; random _residual_ / subject=person_id type=ar(1); run; One big problem is that the specyfing the ar(1) covariance structure balloons my runtime to the point where it's pretty much unreasonable. When I use type=vc, the model finishes in a few minutes. Questions: 1. Is this a reasonable approach to my research question above or would you approach it differently? 2. Are there other options to avoid or diminish the long runtime I'm experiencing with the ar(1) term? (toep? other tricks?) 3. Would switching to proc nlmixed be advantageous (this seems cumbersome but possibly worth it)? Any input or feedback is greatly appreciated!
... View more