08-07-2017 10:04 AM
I am trying to run a repeated measures model in proc mixed with a large data set (~2.5 million observations) using the following code.
proc mixed data=IBMmods.ac method=REML;
class year source month id;
model DO = source / ddfm=kr solution;
repeated month / subject=id type=cs;
The model runs and the output indicates that convergence criteria were met but when I look at the Solutions for Fixed Effects the model has produced parameter estimates, albeit strange ones, but the SE for each estimate is 0 as are the degrees of freedom. t-values and p-values are not produce. I've tried running the model with different covariance structures with the same result. If I omit the random statement, the model runs fine and I get estimates that make sense with their SEs and DFs. I've also tried running the model with a subset of the data (~270,000 observations) with the same results. Any help or insight would be greatly appreciated. I've attached a dummy data set so that you can see the structure of the data I'm working with. Thanks.
08-08-2017 06:52 PM - edited 08-09-2017 01:20 PM
You have typos in your example dataset, but I'll presume that's not the case in the actual dataset. (id 8 and 23 are assigned to both source 1 and 2, and I am guessing that each id should be associated with only one source.)
In your example dataset, each id is associated with only one source and only one year, and there are four repeated measures on each id (one for each of four months). Consequently, id is nested within year. Your current code specifies that id, year and month are random effects factors, and that source is a fixed effects factor. Because neither year or month are in the MODEL statement, you are assuming that the mean of DO does not vary by year or by month: year and month affect only the variance of DO. Your current code specifies that year and id are crossed random effects factors, but most of the year x id combinations have no data:
proc tabulate data=test; class id year month source; table source*id, year*month; run;
I suspect that these missing combinations may be the source of your estimation problem, but I am not sure.
Assuming that my interpretation of your study design is correct, this is the model I would first consider:
proc mixed data=test; class source id year month; model DO = source; random intercept source / subject=year; repeated month / subject=id(year source) type=cs; run;
I definitely would ponder whether year and/or month should be fixed effects factors rather than random effects factors, but your actual data set may have many more years and/or months than is evident in your example data set. In another thread, I made comments on the year random or fixed topic here: https://communities.sas.com/t5/SAS-Statistical-Procedures/How-to-analyze-a-split-plot-study-with-yea...
Edited: I change the RANDOM syntax to one that likely works better with big datasets.
08-09-2017 01:40 PM
I played around with the code some more, and the missing id x year combinations do not seem to be a issue. So that's not the source of your estimation problem. My apologies for heading off track.
08-09-2017 03:18 PM - edited 08-09-2017 03:20 PM
There might be clues in the actual output or log, if you would like to post those.
Is the large number of observations due to many, many id levels? How many years, and how many months?
08-11-2017 10:01 AM
This is all that's displayed in the log when I run the model. It didn't appear that there was anything idicating what the issue might be.
3 proc mixed data=IBMmods.actest method=REML;
NOTE: Writing HTML Body file: sashtml.htm
4 class year source month id;
5 model DO = source / ddfm=kr solution;
6 repeated month / subject=id type=cs;
7 random year;
WARNING: Class levels for ID are not printed because of excessive size.
WARNING: ODS graphics with more than 5000 points have been suppressed. Use the PLOTS(MAXPOINTS= ) option in the PROC MIXED
statement to change or override the cutoff.
NOTE: Convergence criteria met.
NOTE: PROCEDURE MIXED used (Total process time):
real time 36.96 seconds
cpu time 36.45 seconds
And the model output is attached.
There are many ids in the model spanning 4 months for each of 27 years. Individuals are different for each level of year and source. Could the large number of individuals cause problems when trying to look at the random effect of year?
I also just noticed in this output that while it says that there are 731939 IDs in the class level information, there is only one subject in the dimensions category. Additionally the output indicates that all the observations are attributed that one subject. Any thoughts on why this is happening?
08-11-2017 02:18 PM
It seems odd that the parameter estimates for intercept and source have 0 SE and 0 df, and yet the overall test of source in the Type III Tests table does not look unusual--except for denom df = 973 which strikes me as much too small.
Is each ID coded uniquely, as in your example dataset?
Should you have four months of data for each ID? (731939 IDs time 4 months does not equal 2443672 observations, but no missing values are reported.)
I'm beginning to suspect a structural problem with the dataset, perhaps only because I don't have any other ideas.
If you haven't already, I'd compute descriptive statistics to follow up on Paige's comment about one of the variables being always missing or constant.
For your model with REPEATED / TYPE=CS, the code below is a different parameterization of the same model (as long as the CS parameter is not negative). I'd try it, and see if I got the same results.
proc mixed data=test; class source id year month; model y = source / ddfm=kr solution; random intercept / subject=year; random intercept / subject=id(year source); run;
And there's always SAS Tech Support!
08-11-2017 02:04 PM
Please ignore my request for the log file. I noticed it was already posted it here. Can you try to run it with option kr=residual? If you still have SE=0 issue, then you'd better to email me your complete dataset so that I can replicate the issue and fix the bug.