turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- General Programming
- /
- Can I create over 7000 dummies in PROC REG or PROC...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 11:21 AM

Hi, I am working on a project with over 7000 employer groups during time frame of 2007-2013, and need to run a regression model which has expenditure as dependent variable, both employer group, calendar year and interaction between employer group and year as independent variables, among other independent variables. I need to treat the employer group as a fixed effect. And since each employer group has more than 1 year value, this is repeated measure. So I need to cluster the within group variance.

I started with PROC MIXED, but seems SAS is not able to run PROC MIXED with this many dummy varibales?

Then I just test SAS' capacity by using PROC GLM, SAS is able to run this many dummies for PROC GLM! however, the PROC GLM does not correct/control the correlation of repeated measures (especially the data is in univariate format, and cannot transform to multivariate format because doing so will lose other independent variables, such as year).

Thus, I am back to the choice of basic PROC SURVEYREG which allows cluster statement to control correlation of repeated measures. however, since PROC SURVEYREG does not include class statement, I am facing creating over 7000 dummy variables (already did: employer_group 1 - employer_group 7000) and include them into PROC SURVEYREG. this sounds crazy. I don't know how to easily write 7000 dummy variables into PROC SURVEYREG without having to actually write 7000 variables. Any idea? Thanks!

Accepted Solutions

Solution

04-16-2015
11:40 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 11:40 AM

PROC GLMMOD will create the dummy variables for a main effect of this categorical variable, and/or create dummy variables for interactions with this categorical variable, if you wish.

However, I am very skeptical of the idea of having a regression with 7000 dummy variables in it, because it seems to me, without seeing the data, that it is doomed to failure. That many dummy variables are going to be fitting random noise as much as they are fitting a real signal. This is called "overfitting" the model.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 11:27 AM

I think I figured out, just write "employer_group 1 - employer_group 7000" should work

Any other comments are welcome. Thanks!

Solution

04-16-2015
11:40 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 11:40 AM

PROC GLMMOD will create the dummy variables for a main effect of this categorical variable, and/or create dummy variables for interactions with this categorical variable, if you wish.

However, I am very skeptical of the idea of having a regression with 7000 dummy variables in it, because it seems to me, without seeing the data, that it is doomed to failure. That many dummy variables are going to be fitting random noise as much as they are fitting a real signal. This is called "overfitting" the model.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 12:08 PM

Thank you so much! Just looked up PROC GLMMOD. sounds very helpful. Can you provide some sample codes in my case? especially how to "create dummy variables for interactions with this categorical variable"?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 12:49 PM

There is an example in the PROC GLMMOD documentation that demonstrate how it works for interactions. I don't think repeated measures fits in this framework, unless you re-parameterize the model (and I'm not sure if that's possible, I can't explain how to do that, maybe someone else can).

So my comment about "doomed to failure" is going to be ignored here? Of course, that's you're choice, but it was meant as a "red flag"

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 02:55 PM

Hi, thanks a lot!

No, I certainly read your overfitting problem comment. I should had share my thought to that. Basically, I just leave it to the PI who leads the model design.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 12:25 PM

And, in PROC GLMMOD, can the procedure take care repeated measures? or how repeated measures can be later taken care in modeling steps using PROC SURVEYREG or other procedures?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 12:52 PM

Proc SURVEYREG is not well designed for repeated measures, as it assumes that the residuals for the regression are NID, thus any autocorrelation is viewed as a pretty substantial violation of assumptions. To accommodate survey weighting, see Example 44.18 Weighted Multilevel Model for Survey Data in the PROC GLIMMIX documentation (SAS/STAT13.2). The example can be expanded to include a G side repeated measures structure. All that you need to make it "work" is a lot of RAM that is addressable through SAS.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 02:57 PM

Hi Steve,

Thanks for the comments and thoughts.

Btw, what is RAM stands for?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 03:03 PM

Read-access memory (RAM) is the amount of memory that a computer can use for computations. This is different from stored memory on disk. Many modern computers have 8GB or 16GB of RAM.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 12:40 PM

I think I got it. Thank you!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-16-2015 04:09 PM

Thanks :smileygrin: