turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- BI
- /
- Enterprise Guide
- /
- Modl not a full rank, dummy variables

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

So I created dummy variables to run a regression:

Here's how I did that:

data models (drop=weekday); set have; if weekday=1 then mon=1; else mon=0; if weekday in (2,3,4) then twt=1; else twt=0; if weekday=5 then fri=1; else fri=0; if weekday=6 then sat=1; else sat=0; if weekday=0 then sun=1; else sun=0; if Temperature =< 75 then low=1; else low=0; if Temperature > 75 and Temperature < 85 then mid=1; else mid=0; if Temperature => 85 then high=1; else high=0; if Month=6 then june=1; else june=0; if Month=7 then july=1; else july=0; if Month=8 then august=1; else august=0; run;

And here's my reg code:

proc sort data=models; by Hour; run; proc reg data=models; where Month in (6,7,8); model Load = june july august Temperature low mid high DewPoint WindSpeed CloudCover SolarRadiation mon twt fri sat sun; by Hour; run;

Now, everything works but I get this NOTE:

Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.

Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.

And this thing that I don't know how to even interpret:

Accepted Solutions

Solution

3 weeks ago

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to matt23

3 weeks ago - last edited 3 weeks ago

This is what happens with dummy variables. It is expected. You can't estimate effects of ALL of the dummy variables. Why, because if you know the values of mon, twt, fri and sat, then the value of sun is uniquely determined and has no additional value.

Now, you might be better off doing this in PROC GLM, and there are two benefits here:

- You don't have to create the dummy variables yourself. PROC GLM can create the dummy variables for you, behind the scenes, and utilize them properly without you having the create the dummy variables first. Temperature would be better off handled as a continuous variable. Months can be handled via dummy variables (unless the months span different years, that would be a different problem). Your weekday with Tuesday Wednesday and Thursday combined into a single dummy variable would have to be handled by creating dummy variables; or let GLM create the dummy variables and then see if the effect of Tuesday, Wednesday and Thursday are not statistically different.
- GLM will produce LSMEANS, which is the numbers you really really really really want to look at instead of the regression coefficients as you are doing now. And by looking at the LSMEANS instead of the regression coefficients, the issue about not full rank giving 0 coefficients also goes away.

Less work, more interpretable results, sounds like GLM is a win-win!

--

Paige Miller

Paige Miller

All Replies

Solution

3 weeks ago

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to matt23

3 weeks ago - last edited 3 weeks ago

This is what happens with dummy variables. It is expected. You can't estimate effects of ALL of the dummy variables. Why, because if you know the values of mon, twt, fri and sat, then the value of sun is uniquely determined and has no additional value.

Now, you might be better off doing this in PROC GLM, and there are two benefits here:

- You don't have to create the dummy variables yourself. PROC GLM can create the dummy variables for you, behind the scenes, and utilize them properly without you having the create the dummy variables first. Temperature would be better off handled as a continuous variable. Months can be handled via dummy variables (unless the months span different years, that would be a different problem). Your weekday with Tuesday Wednesday and Thursday combined into a single dummy variable would have to be handled by creating dummy variables; or let GLM create the dummy variables and then see if the effect of Tuesday, Wednesday and Thursday are not statistically different.
- GLM will produce LSMEANS, which is the numbers you really really really really want to look at instead of the regression coefficients as you are doing now. And by looking at the LSMEANS instead of the regression coefficients, the issue about not full rank giving 0 coefficients also goes away.

Less work, more interpretable results, sounds like GLM is a win-win!

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

3 weeks ago

Thank you, this is really helpful. Do you have an example of the GLM code? I need to find regression models based on these variables and I actually never used GLM for regression and from what you are saying it would be the best option

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to matt23

3 weeks ago

Plenty of examples in the PROC GLM documentation

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to matt23

3 weeks ago

That message arises when one (or more) of your variables can be determined by the values of one or more other variables.

Example every time June or July= 1 August=0, and every time June=0 and July=0 then August=1. August could be calculated from the values of June and july. so August isn't actually needed.

You have a similar case with the day of week and likely the temperature dummies.

I suspect that you might be better off going to Proc GLM and providing CLASS variables with appropriate formats to create the groups based on month, day of week and temperature ranges.

And if your "hour" variable represents time of day the SolarRadiation might have issues adding to the variable in hours of darkness, which change somewhat.

SAS with the regression procedures that allow CLASS variables removes the need for you to create the classes and does not over specify class variables.

Are you really sure that Monday and Friday have different effects than Tue, Wed and Thu?

Anyway formats such as

proc format; value weekday5_ 1='Mon' 2,3,4='TWT' 5='Fri' 6='Sat' 7='Sun' ; value weekday2_ 1-5='Weekday' 6-7='Weekend' ; run;

could be used to provide different groups when using a class variable and only changing the format during the proc run:

proc format; value weekday5_ 1='Mon' 2,3,4='TWT' 5='Fri' 6='Sat' 7='Sun' ; value weekday2_ 1-5='Weekday' 6-7='Weekend' ; run; proc glm data=have; class weekday; format weekday weekday5_.; <other code> run; proc glm data=have; class weekday; format weekday weekday2_.; <other code> run;

would run the same code but with the number of categories for the class variable weekday with 5 or 2.

The rules for custom formats is that you can't end in a number hence the _ in the example.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ballardw

3 weeks ago

okay so I ran GLM instead and It is more useful. I used these variables:

data want; set models; if weekday=1 then week_day=1; if weekday in (2,3,4) then week_day=2; if weekday=5 then week_day=5; if weekday=6 then week_day=6; if weekday=0 then week_day=0; if Temperature =< 75 then temp=1; if Temperature > 75 and Temperature < 85 then temp=2; if Temperature => 85 then temp=3; run;

used this code:

proc sort data=want; by Hour; run; proc glm data=want; where Month in (6,7,8); class Month week_day temp; model Load = Month week_day Temperature temp DewPoint WindSpeed CloudCover SolarRadiation / solution; by Hour; run;

And got this for hour=0. Solar Radiation of course is 0 at this time. But I get 0 for sunday and medium temperature. How can I interpret this if I want a regression equation?

Parameter Estimate Standard Error t Value Pr > |t|

Intercept | -8746.485035 | B | 1890.251873 | -4.63 | <.0001 |

Month 6 | -584.895553 | B | 198.704451 | -2.94 | 0.0033 |

Month 7 | -165.533354 | B | 186.205289 | -0.89 | 0.3743 |

Month 8 | 0.000000 | B | . | . | . |

week_day 0 | -2786.546010 | B | 276.760059 | -10.07 | <.0001 |

week_day 1 | -2041.270203 | B | 276.661396 | -7.38 | <.0001 |

week_day 2 | 1677.265795 | B | 225.963521 | 7.42 | <.0001 |

week_day 5 | 1781.359362 | B | 276.279722 | 6.45 | <.0001 |

week_day 6 | 0.000000 | B | . | . | . |

Temperature | 1185.308015 | 43.838666 | 27.04 | <.0001 | |

temp 1 | -2049.847453 | B | 244.709459 | -8.38 | <.0001 |

temp 2 | 0.000000 | B | . | . | . |

DewPoint | 46.832461 | 36.470151 | 1.28 | 0.1995 | |

WindSpeed | 32.324264 | 44.988656 | 0.72 | 0.4727 | |

CloudCover | -34.170070 | 5.695353 | -6.00 | <.0001 | |

SolarRadiation | 0.000000 | B | . | . | . |

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to matt23

3 weeks ago

@matt23 wrote:

okay so I ran GLM instead and It is more useful. I used these variables:

data want; set models; if weekday=1 then week_day=1; if weekday in (2,3,4) then week_day=2; if weekday=5 then week_day=5; if weekday=6 then week_day=6; if weekday=0 then week_day=0; if Temperature =< 75 then temp=1; if Temperature > 75 and Temperature < 85 then temp=2; if Temperature => 85 then temp=3; run;used this code:

proc sort data=want; by Hour; run; proc glm data=want; where Month in (6,7,8); class Month week_day temp; model Load = Month week_day Temperature temp DewPoint WindSpeed CloudCover SolarRadiation / solution; by Hour; run;And got this for hour=0. Solar Radiation of course is 0 at this time. But I get 0 for sunday and medium temperature. How can I interpret this if I want a regression equation?

Parameter Estimate Standard Error t Value Pr > |t|

Intercept -8746.485035 B 1890.251873 -4.63 <.0001 Month 6 -584.895553 B 198.704451 -2.94 0.0033 Month 7 -165.533354 B 186.205289 -0.89 0.3743 Month 8 0.000000 B . . . week_day 0 -2786.546010 B 276.760059 -10.07 <.0001 week_day 1 -2041.270203 B 276.661396 -7.38 <.0001 week_day 2 1677.265795 B 225.963521 7.42 <.0001 week_day 5 1781.359362 B 276.279722 6.45 <.0001 week_day 6 0.000000 B . . . Temperature 1185.308015 43.838666 27.04 <.0001 temp 1 -2049.847453 B 244.709459 -8.38 <.0001 temp 2 0.000000 B . . . DewPoint 46.832461 36.470151 1.28 0.1995 WindSpeed 32.324264 44.988656 0.72 0.4727 CloudCover -34.170070 5.695353 -6.00 <.0001 SolarRadiation 0.000000 B . . .

If you want "interpretation", you use LSMEANS command in PROC GLM. If you want a regression equation, you use the values under Estimate (but you don't really want your own regression equation, do you? SAS can do the predictions for you, so you don't have to do it manually)

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

3 weeks ago

what do you mean by that? Is there a better way of obtaining a regression equation than from 'Parameter Estimate'?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to matt23

3 weeks ago

The only time I can think of where you need to write down the regression equation is to put it in a report.

If you are going to do calculations with the regression equation, do it in SAS. Do not try to take each value under estimate and use them yourself.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ballardw

3 weeks ago

Oh I think I get it. So if it is a Sunday it puts 0* for all days of the week right? So it's just intercept + 0*all days + the rest of the equation?