Hi, I am working on a project with over 7000 employer groups during time frame of 2007-2013, and need to run a regression model which has expenditure as dependent variable, both employer group, calendar year and interaction between employer group and year as independent variables, among other independent variables. I need to treat the employer group as a fixed effect. And since each employer group has more than 1 year value, this is repeated measure. So I need to cluster the within group variance.
I started with PROC MIXED, but seems SAS is not able to run PROC MIXED with this many dummy varibales?
Then I just test SAS' capacity by using PROC GLM, SAS is able to run this many dummies for PROC GLM! however, the PROC GLM does not correct/control the correlation of repeated measures (especially the data is in univariate format, and cannot transform to multivariate format because doing so will lose other independent variables, such as year).
Thus, I am back to the choice of basic PROC SURVEYREG which allows cluster statement to control correlation of repeated measures. however, since PROC SURVEYREG does not include class statement, I am facing creating over 7000 dummy variables (already did: employer_group 1 - employer_group 7000) and include them into PROC SURVEYREG. this sounds crazy. I don't know how to easily write 7000 dummy variables into PROC SURVEYREG without having to actually write 7000 variables. Any idea? Thanks!
PROC GLMMOD will create the dummy variables for a main effect of this categorical variable, and/or create dummy variables for interactions with this categorical variable, if you wish.
However, I am very skeptical of the idea of having a regression with 7000 dummy variables in it, because it seems to me, without seeing the data, that it is doomed to failure. That many dummy variables are going to be fitting random noise as much as they are fitting a real signal. This is called "overfitting" the model.
I think I figured out, just write "employer_group 1 - employer_group 7000" should work:)
Any other comments are welcome. Thanks!
PROC GLMMOD will create the dummy variables for a main effect of this categorical variable, and/or create dummy variables for interactions with this categorical variable, if you wish.
However, I am very skeptical of the idea of having a regression with 7000 dummy variables in it, because it seems to me, without seeing the data, that it is doomed to failure. That many dummy variables are going to be fitting random noise as much as they are fitting a real signal. This is called "overfitting" the model.
Thank you so much! Just looked up PROC GLMMOD. sounds very helpful. Can you provide some sample codes in my case? especially how to "create dummy variables for interactions with this categorical variable"?
There is an example in the PROC GLMMOD documentation that demonstrate how it works for interactions. I don't think repeated measures fits in this framework, unless you re-parameterize the model (and I'm not sure if that's possible, I can't explain how to do that, maybe someone else can).
So my comment about "doomed to failure" is going to be ignored here? Of course, that's you're choice, but it was meant as a "red flag"
Hi, thanks a lot!
No, I certainly read your overfitting problem comment. I should had share my thought to that. Basically, I just leave it to the PI who leads the model design.
And, in PROC GLMMOD, can the procedure take care repeated measures? or how repeated measures can be later taken care in modeling steps using PROC SURVEYREG or other procedures?
Proc SURVEYREG is not well designed for repeated measures, as it assumes that the residuals for the regression are NID, thus any autocorrelation is viewed as a pretty substantial violation of assumptions. To accommodate survey weighting, see Example 44.18 Weighted Multilevel Model for Survey Data in the PROC GLIMMIX documentation (SAS/STAT13.2). The example can be expanded to include a G side repeated measures structure. All that you need to make it "work" is a lot of RAM that is addressable through SAS.
Steve Denham
Hi Steve,
Thanks for the comments and thoughts.
Btw, what is RAM stands for?
Read-access memory (RAM) is the amount of memory that a computer can use for computations. This is different from stored memory on disk. Many modern computers have 8GB or 16GB of RAM.
I think I got it. Thank you!
Thanks :smileygrin:
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.