I have a complicated modeling problem and have been struggling with astronomical runtimes on top of narrowing in on correct model specification.
Quick background without getting too specific: I have data over time for all people that are part of a huge organization. These people are part of large, geographically distinct groups, and within those groups, they are members of smaller units (varying in size between 5 people to 500 at any one time). People move between units and between groups, but not that often. I have 8 years of data for 1.9 million people. Because of the size of the dataset, I've chosen my time unit to be a quarter (~90 days), so my unit of analysis is a person-unit-quarter. If a person is in two units in a quarter, he has two rows in my dataset, each indicating the number of days in the quarter that he spent in that unit. Right now, I'm stratifying by group and running a separate model for each group to further decrease the sample size and runtime.
Example of data structure:
Person Quarter Group Unit Days
1 1 X A 90
1 2 X A 40
1 2 X B 50
2 1 X A 2
2 1 X C 30
2 1 X A 58
For each person at each time point, I am calculating their cumulative exposure to a specific experience as well as their unit's cumulative exposure to the same experience (defined as the sum of the cumulative exposure of everyone else in their unit during that quarter). I'm trying to model the relationship between these two exposures (theirs and their peers') to the risk of having a specific kind of mental health medical visit during the quarter (0/1). "Does an increase in a person's unit's cumulative exposure increase the person's probability of having a visit even when controlling for their personal cumulative exposure?"
So my model is a multilevel logistic regression that I'm running in glimmix with random effects at the unit and at the person level. I am specifying the autoregressive covariance structure on the individual-level random effect because I have repeated measures for each person over time. I'm also controlling for the number of days spent in the unit/quarter because the more time spent in a unit/quarter, the more likely a person is to experience a visit. Here is a simplistic version of my code:
proc glimmix data=DATA;
class unit_id;
model visit (event="1") = days_in_unit_quarter personal_exposure unit_exposure other_covariates year_dummies
/ dist=binary link=logit solution;
random intercept / subject=unit_id;
random _residual_ / subject=person_id type=ar(1);
run;
One big problem is that the specyfing the ar(1) covariance structure balloons my runtime to the point where it's pretty much unreasonable. When I use type=vc, the model finishes in a few minutes.
Questions:
1. Is this a reasonable approach to my research question above or would you approach it differently?
2. Are there other options to avoid or diminish the long runtime I'm experiencing with the ar(1) term? (toep? other tricks?)
3. Would switching to proc nlmixed be advantageous (this seems cumbersome but possibly worth it)?
Any input or feedback is greatly appreciated!
Is there a natural basis for "rolling up" the data to an events/trials syntax? If so, that might help.
Other things to consider: Try a different optimizer with an NLOPTIONS statement. The default quasi-Newton method and binary distribution don't always work so well together. Another thing would be to model the repeated nature as G side, rather than as an overdispersion.
And finally, consider subsampling the data and then bootstrapping the results. I cannot imagine the size of the design matrix here, and consequently the variance-covariance matrix, which is causing I/O problems that lead to the long run times.
Steve Denham
"Is there a natural basis for "rolling up" the data to an events/trials syntax? If so, that might help."
Can you explain a little more what you mean by this?
I had considered sampling. Thank you!
Well, by rolling up, I mean finding a categorical variable that the individuals could be aggregated on. This is essentially a sort of stratification.
Other things of note--I think it would help to model this over time as a repeated measure, rather than over individuals as an overdispersion. Do the variables "year_dummies" index the repeated observations on individuals? If so, consider removing them, adding year as a class variable, and shifting the last RANDOM statement to something like:
random year/residual subject=person_id type=ar(1);
Also be certain that your dataset is sorted by person_id, since it does not appear in the CLASS statement (and you probably do not want to include it, based on the size of the dataset).
Steve Denham
Thank you for your responses. I still don't quite understand what you mean by aggregating on a categorical variable, but I have many individual-level covariates that I need to preserve, so I don't think that is a good option for me.
When I tried your suggestions about modeling over time as a repeated measure, I get an error in the log: "Model is too large to be fit by PROC GLIMMIX in a reasonable amount of time on this system. Consider changing your model." So that doesn't seem like it will work.
Yes I've been sure to sort my data correctly.
I'm going to try the sampling technique and see how I fair with that. Thanks again for your input.
Update:
I decided to try a bootstrapping approach to be able to run this model given the huge amount of data, which is my main obstacle here.
Since my data is clustered at the person and the unit level, I want to double check whether my sampling approach is valid. I decided to sample at the unit level with replacement. For each of the 1000 runs of the model, I am sampling a random 2.5% of all units in the data and pulling in all observations that are associated with those units. This means that because people move between units, I am not pulling all observations for each person.
The model runs on these samples in about 15 minutes, so if I run the code in parallel, it still takes a few days to run 1000 samples, but at least it's possible to get some estimates.
This is the full code:
proc surveyselect data = DATA_FULL out=DATA_SAMPLE method=srs samprate=.025;
samplingunit unit_id;
run;
proc glimmix data=DATA_SAMPLE;
class unit_id;
model visit (event="1") = days_in_unit_quarter personal_exposure unit_exposure other_covariates year_dummies
/ dist=binary link=logit solution;
random intercept / subject=unit_id;
random _residual_ / subject=person_id type=ar(1);
run;
Is this a valid approach?
Thank you,
Alison
This looks good, Alison. The key here is to make sure the data are sorted by person_id and by observation order within person_id, and that the observation ordering is equally spaced. Given all of that, you should be in good shape.
Now how about the model averaging--simple bootstrap is easiest to code from the 1000 results. I would worry about regression to the mean, though, with the R side approach (marginal estimates). If it runs in a reasonable time frame, consider the following:
proc glimmix data=DATA_SAMPLE method=laplace;
class unit_id;
model visit (event="1") = days_in_unit_quarter personal_exposure unit_exposure other_covariates year_dummies
/ dist=binary link=logit solution;
random intercept / subject=unit_id;
random intercept / subject=person_id(unit_id) type=ar(1);/* really wish there was an obvious time indexing variable here */
run;
I have absolutely no feel for whether this will give what you need, but I've been reading Stroup again, and I worry about marginal values, especially with an overdispersion.
Steve Denham
The observation ordering is not equally spaced. For the most part, each observation is associated with one quarter (~90 days), but because people move between units so much, some observations are associated with shorter periods of time. I have a variable in the model "days_in_unit_quarter" that is the exposure time for that observation. Is it still a problem that the observations are not equally spaced?
To reiterate, I have all information for the entire population over a 10 year period. The data structure is individuals and their individual-level characteristics that contribute to the outcome, nested within units that have their unit-level characteristics that contribute to the outcome. I'm assuming there is something I am not measuring about the units that can be described as the "unitness" that I am trying to capture in the random effect in the model. Individuals move between units, and I'm using the quarter as my unit of time because if I use day, the dataset would be unmanagebly large (but I could use day if that is necessary). So I have repeated measures of individuals that are nested within units but the individuals move between units over time sometimes in the middle of the quarter. The data are very complicated, but also very complete, so it is just a matter of setting it up the right way and asking it the right questions, I think!
I am trying you're suggestion below. Can you expand on what you mean by your comment "really wish there was an obvious time indexing variable". I have all of the information about time, so I can create any kind of time variable I might need.
Alison
Steve,
I tried the code you suggested, but I keep getting an error message for insufficient memory no matter how small of a sample I take. I don't think it will work.
I don't think I understand your design well enough to say whether the bootstrap method is valid, but certainly that can't be the full code: You haven't specified the bootstrap looping variable, turned off the output, or created output data sets for your results. At the very least you need to add a few lines:
proc surveyselect data = DATA_FULL out=DATA_SAMPLE method=srs samprate=.025 REPS=1000;
samplingunit unit_id;
run;
ods graphics off;
ods exclude all;
proc glimmix data=DATA_SAMPLE;
class Replicate unit_id;
model visit (event="1") = days_in_unit_quarter personal_exposure unit_exposure other_covariates year_dummies
/ dist=binary link=logit solution;
random intercept / subject=unit_id;
random _residual_ / subject=person_id type=ar(1);
ODS OUTPUT TABLE1=DS1 TABLE2=DS2 ETC;
run;
ods exclude none;
If you need a refresher course on efficient bootstrap techniques, see
Since you claim that your model is slow to run, I'd also suggest that you specify a REPS= value that will run overnight, like REPS=200.
That way you can look at partial results in the morning. Then do the same thing the next night with new file names and use PROC APPEND each morning to maintain a data set that contains the total bootstrap results to date.
If you can get a friend or two to allow you to use their computers to launch the driver program overnight, you can crank out this bootstrap in 1-2 evenings.
You're right Rick_SAS. The "full code" I excerpted is within a macro that I run 1000 times. I'm not familiar with the way you've coded it, but do you think it would run faster than below?
%macro loopy(num=);
proc surveyselect data = DATA_FULL out=DATA_SAMPLE method=srs samprate=.025;
samplingunit unit_id;
run;
proc sort data = DATA_SAMPLE; by person_id order_var;
proc glimmix data=DATA_SAMPLE noclprint ic=pq;
class unit_id;
model visit (event="1") = days_in_unit_quarter personal_exposure unit_exposure other_covariates year_dummies
/ dist=binary link=logit solution stdcoef;
random intercept / subject=unit_id;
random _residual_ / subject=person_id type=ar(1);
ods output parameterestimates = test.param_output_&num
StandardizedCoefficients = test.stdcoef_output_#
run;
%mend;
%loopy(num=1);
%loopy(num=2);
etc.
As David Cassell famously said, "Don't be loopy!" Read the blog posts I linked to. Since you claim that each iteration requires 15 minutes, the time saving won't be as noticeable as the example in my blog. However, IMHO it is always helpful to start building good habits by bootstrapping efficiently.
I understand that based on the surveyselect proceedure, the output file DATA_SAMPLE has the 1000 replicate samples of size 0.025 of given data DATA_FULL which is still 2.5 times bigger then the original data file. I don't understand how SAS will know which replicate sample to chose without parsing the data, or without using the Replicate variable with BY statement, or without a loop.
I am running similar models and the problem persists after the fix; I will appreciate if you explain this a bit more. Secondly will you kindly also explain what you mean by ODS OUTPUT TABLE1=DS1 TABLE2=DS2 ETC and how we can actually impliment if we don't know the exact number of tables we will get.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.