BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jackie_h_2012
Calcite | Level 5

Hi, I am working on a project with over 7000 employer groups during time frame of 2007-2013, and need to run a regression model which has expenditure as dependent variable, both employer group, calendar year and interaction between employer group and year as independent variables, among other independent variables. I need to treat the employer group as a fixed effect. And since each employer group has more than 1 year value, this is repeated measure. So I need to cluster the within group variance.

I started with PROC MIXED, but seems SAS is not able to run PROC MIXED with this many dummy varibales?

Then I just test SAS' capacity by using PROC GLM, SAS is able to run this many dummies for PROC GLM! however, the PROC GLM does not correct/control the correlation of repeated measures (especially the data is in univariate format, and cannot transform to multivariate format because doing so will lose other independent variables, such as year).

Thus, I am back to the choice of basic PROC SURVEYREG which allows cluster statement to control correlation of repeated measures. however, since PROC SURVEYREG does not include class statement, I am facing creating over 7000 dummy variables (already did: employer_group 1 - employer_group 7000) and include them into PROC SURVEYREG. this sounds crazy. I don't know how to easily write 7000 dummy variables into PROC SURVEYREG without having to actually write 7000 variables. Any idea? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

PROC GLMMOD will create the dummy variables for a main effect of this categorical variable, and/or create dummy variables for interactions with this categorical variable, if you wish.

However, I am very skeptical of the idea of having a regression with 7000 dummy variables in it, because it seems to me, without seeing the data, that it is doomed to failure. That many dummy variables are going to be fitting random noise as much as they are fitting a real signal. This is called "overfitting" the model.

--
Paige Miller

View solution in original post

11 REPLIES 11
jackie_h_2012
Calcite | Level 5

I think I figured out, just write "employer_group 1 - employer_group 7000" should work:)

Any other comments are welcome. Thanks!

PaigeMiller
Diamond | Level 26

PROC GLMMOD will create the dummy variables for a main effect of this categorical variable, and/or create dummy variables for interactions with this categorical variable, if you wish.

However, I am very skeptical of the idea of having a regression with 7000 dummy variables in it, because it seems to me, without seeing the data, that it is doomed to failure. That many dummy variables are going to be fitting random noise as much as they are fitting a real signal. This is called "overfitting" the model.

--
Paige Miller
jackie_h_2012
Calcite | Level 5

Thank you so much! Just looked up PROC GLMMOD. sounds very helpful. Can you provide some sample codes in my case? especially how to "create dummy variables for interactions with this categorical variable"?

PaigeMiller
Diamond | Level 26

There is an example in the PROC GLMMOD documentation that demonstrate how it works for interactions. I don't think repeated measures fits in this framework, unless you re-parameterize the model (and I'm not sure if that's possible, I can't explain how to do that, maybe someone else can).

So my comment about "doomed to failure" is going to be ignored here? Of course, that's you're choice, but it was meant as a "red flag"

--
Paige Miller
jackie_h_2012
Calcite | Level 5

Hi, thanks a lot!

No, I certainly read your overfitting problem comment. I should had share my thought to that. Basically, I just leave it to the PI who leads the model design.

jackie_h_2012
Calcite | Level 5

And, in PROC GLMMOD, can the procedure take care repeated measures? or how repeated measures can be later taken care in modeling steps using PROC SURVEYREG or other procedures?

SteveDenham
Jade | Level 19

Proc SURVEYREG is not well designed for repeated measures, as it assumes that the residuals for the regression are NID, thus any autocorrelation is viewed as a pretty substantial violation of assumptions.  To accommodate survey weighting, see Example 44.18 Weighted Multilevel Model for Survey Data in the PROC GLIMMIX documentation (SAS/STAT13.2).  The example can be expanded to include a G side repeated measures structure.  All that you need to make it "work" is a lot of RAM that is addressable through SAS.

Steve Denham

jackie_h_2012
Calcite | Level 5

Hi Steve,

Thanks for the comments and thoughts.

Btw, what is RAM stands for?

Rick_SAS
SAS Super FREQ

Read-access memory (RAM) is the amount of memory that a computer can use for computations.  This is different from stored memory on disk.  Many modern computers have 8GB or 16GB of RAM.

jackie_h_2012
Calcite | Level 5

I think I got it. Thank you! Smiley Happy

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1131 views
  • 6 likes
  • 4 in conversation