Desktop productivity for business analysts and programmers

Modl not a full rank, dummy variables

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 89
Accepted Solution

Modl not a full rank, dummy variables

So I created dummy variables to run a regression:

Here's how I did that:

data models (drop=weekday);
set have;
if weekday=1 then mon=1;
else mon=0;
if weekday in (2,3,4) then twt=1;
else twt=0;
if weekday=5 then fri=1;
else fri=0;
if weekday=6 then sat=1;
else sat=0;
if weekday=0 then sun=1;
else sun=0;
if Temperature =< 75 then low=1;
else low=0;
if Temperature > 75 and Temperature < 85 then mid=1;
else mid=0;
if Temperature => 85 then high=1;
else high=0;
if Month=6 then june=1;
else june=0;
if Month=7 then july=1;
else july=0;
if Month=8 then august=1;
else august=0;
run;

 

 

And here's my reg code:

proc sort data=models;
by Hour;
run;

proc reg data=models;
where Month in (6,7,8);
model Load = june july august Temperature low mid high DewPoint 
WindSpeed CloudCover SolarRadiation mon twt fri sat sun;
by Hour;
run;

 

Now, everything works but I get this NOTE:

Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.

Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.

 

And this thing that I don't know how to even interpret:

fff.png

 


Accepted Solutions
Solution
3 weeks ago
Respected Advisor
Posts: 3,271

Re: Modl not a full rank, dummy variables

[ Edited ]

This is what happens with dummy variables. It is expected. You can't estimate effects of ALL of the dummy variables. Why, because if you know the values of mon, twt, fri and sat, then the value of sun is uniquely determined and has no additional value.

 

Now, you might be better off doing this in PROC GLM, and there are two benefits here:

  1. You don't have to create the dummy variables yourself. PROC GLM can create the dummy variables for you, behind the scenes, and utilize them properly without you having the create the dummy variables first. Temperature would be better off handled as a continuous variable. Months can be handled via dummy variables (unless the months span different years, that would be a different problem). Your weekday with Tuesday Wednesday and Thursday combined into a single dummy variable would have to be handled by creating dummy variables; or let GLM create the dummy variables and then see if the effect of Tuesday, Wednesday and Thursday are not statistically different.
  2. GLM will produce LSMEANS, which is the numbers you really really really really want to look at instead of the regression coefficients as you are doing now. And by looking at the LSMEANS instead of the regression coefficients, the issue about not full rank giving 0 coefficients also goes away.

Less work, more interpretable results, sounds like GLM is a win-win!

--
Paige Miller

View solution in original post


All Replies
Solution
3 weeks ago
Respected Advisor
Posts: 3,271

Re: Modl not a full rank, dummy variables

[ Edited ]

This is what happens with dummy variables. It is expected. You can't estimate effects of ALL of the dummy variables. Why, because if you know the values of mon, twt, fri and sat, then the value of sun is uniquely determined and has no additional value.

 

Now, you might be better off doing this in PROC GLM, and there are two benefits here:

  1. You don't have to create the dummy variables yourself. PROC GLM can create the dummy variables for you, behind the scenes, and utilize them properly without you having the create the dummy variables first. Temperature would be better off handled as a continuous variable. Months can be handled via dummy variables (unless the months span different years, that would be a different problem). Your weekday with Tuesday Wednesday and Thursday combined into a single dummy variable would have to be handled by creating dummy variables; or let GLM create the dummy variables and then see if the effect of Tuesday, Wednesday and Thursday are not statistically different.
  2. GLM will produce LSMEANS, which is the numbers you really really really really want to look at instead of the regression coefficients as you are doing now. And by looking at the LSMEANS instead of the regression coefficients, the issue about not full rank giving 0 coefficients also goes away.

Less work, more interpretable results, sounds like GLM is a win-win!

--
Paige Miller
Frequent Contributor
Posts: 89

Re: Modl not a full rank, dummy variables

Posted in reply to PaigeMiller
Thank you, this is really helpful. Do you have an example of the GLM code? I need to find regression models based on these variables and I actually never used GLM for regression and from what you are saying it would be the best option
Respected Advisor
Posts: 3,271

Re: Modl not a full rank, dummy variables

Plenty of examples in the PROC GLM documentation

 

http://documentation.sas.com/?cdcId=pgmmvacdc&cdcVersion=9.4&docsetId=statug&docsetTarget=statug_glm...

--
Paige Miller
Super User
Posts: 13,941

Re: Modl not a full rank, dummy variables

That message arises when one (or more) of your variables can be determined by the values of one or more other variables.

Example every time June or July= 1 August=0, and every time June=0 and July=0 then August=1. August could be calculated from the values of June and july. so August isn't actually needed.

 

You have a similar case with the day of week and likely the temperature dummies.

 

I suspect that you might be better off going to Proc GLM and providing CLASS variables with appropriate formats to create the groups based on month, day of week and temperature ranges.

 

And if your "hour" variable represents time of day the SolarRadiation might have issues adding to the variable in hours of darkness, which change somewhat.

 

SAS with the regression procedures that allow CLASS variables removes the need for you to create the classes and does not over specify class variables.

 

Are you really sure that Monday and Friday have different effects than Tue, Wed and Thu?

 

Anyway formats such as

proc format;
value weekday5_
1='Mon'
2,3,4='TWT'
5='Fri'
6='Sat'
7='Sun'
;
value weekday2_
1-5='Weekday'
6-7='Weekend'
;
run;

could be used to provide different groups when using a class variable and only changing the format during the proc run:

 

 

proc format;
value weekday5_
1='Mon'
2,3,4='TWT'
5='Fri'
6='Sat'
7='Sun'
;
value weekday2_
1-5='Weekday'
6-7='Weekend'
;
run;

proc glm data=have;
   class weekday;
   format weekday weekday5_.;
   <other code>
run;

proc glm data=have;
   class weekday;
   format weekday weekday2_.;
   <other code>
run;

would run the same code but with the number of categories for the class variable weekday with 5 or 2.

 

The rules for custom formats is that you can't end in a number hence the _ in the example.

Frequent Contributor
Posts: 89

Re: Modl not a full rank, dummy variables

okay so I ran GLM instead and It is more useful. I used these variables:

 

data want;
set models;
if weekday=1 then week_day=1;
if weekday in (2,3,4) then week_day=2;
if weekday=5 then week_day=5;
if weekday=6 then week_day=6;
if weekday=0 then week_day=0;
if Temperature =< 75 then temp=1;
if Temperature > 75 and Temperature < 85 then temp=2;
if Temperature => 85 then temp=3;
run;

used this code:

 

proc sort data=want;
by Hour;
run;

proc glm data=want;
where Month in (6,7,8);
class Month week_day temp;
model Load = Month week_day Temperature temp DewPoint WindSpeed CloudCover SolarRadiation / solution;
by Hour;
run;

And got this for hour=0. Solar Radiation of course is 0 at this time. But I get 0 for sunday and medium temperature. How can I interpret this if I want a regression equation?

 

 

Parameter Estimate   Standard Error t Value Pr > |t|
Intercept-8746.485035B1890.251873-4.63<.0001
Month 6-584.895553B198.704451-2.940.0033
Month 7-165.533354B186.205289-0.890.3743
Month 80.000000B...
week_day 0-2786.546010B276.760059-10.07<.0001
week_day 1-2041.270203B276.661396-7.38<.0001
week_day 21677.265795B225.9635217.42<.0001
week_day 51781.359362B276.2797226.45<.0001
week_day 60.000000B...
Temperature1185.308015 43.83866627.04<.0001
temp 1-2049.847453B244.709459-8.38<.0001
temp 20.000000B...
DewPoint46.832461 36.4701511.280.1995
WindSpeed32.324264 44.9886560.720.4727
CloudCover-34.170070 5.695353-6.00<.0001
SolarRadiation0.000000B..

.

Respected Advisor
Posts: 3,271

Re: Modl not a full rank, dummy variables


@matt23 wrote:

okay so I ran GLM instead and It is more useful. I used these variables:

 

data want;
set models;
if weekday=1 then week_day=1;
if weekday in (2,3,4) then week_day=2;
if weekday=5 then week_day=5;
if weekday=6 then week_day=6;
if weekday=0 then week_day=0;
if Temperature =< 75 then temp=1;
if Temperature > 75 and Temperature < 85 then temp=2;
if Temperature => 85 then temp=3;
run;

used this code:

 

proc sort data=want;
by Hour;
run;

proc glm data=want;
where Month in (6,7,8);
class Month week_day temp;
model Load = Month week_day Temperature temp DewPoint WindSpeed CloudCover SolarRadiation / solution;
by Hour;
run;

And got this for hour=0. Solar Radiation of course is 0 at this time. But I get 0 for sunday and medium temperature. How can I interpret this if I want a regression equation?

 

 

Parameter Estimate   Standard Error t Value Pr > |t|
Intercept -8746.485035 B 1890.251873 -4.63 <.0001
Month 6 -584.895553 B 198.704451 -2.94 0.0033
Month 7 -165.533354 B 186.205289 -0.89 0.3743
Month 8 0.000000 B . . .
week_day 0 -2786.546010 B 276.760059 -10.07 <.0001
week_day 1 -2041.270203 B 276.661396 -7.38 <.0001
week_day 2 1677.265795 B 225.963521 7.42 <.0001
week_day 5 1781.359362 B 276.279722 6.45 <.0001
week_day 6 0.000000 B . . .
Temperature 1185.308015   43.838666 27.04 <.0001
temp 1 -2049.847453 B 244.709459 -8.38 <.0001
temp 2 0.000000 B . . .
DewPoint 46.832461   36.470151 1.28 0.1995
WindSpeed 32.324264   44.988656 0.72 0.4727
CloudCover -34.170070   5.695353 -6.00 <.0001
SolarRadiation 0.000000 B . .

.


If you want "interpretation", you use LSMEANS command in PROC GLM. If you want a regression equation, you use the values under Estimate (but you don't really want your own regression equation, do you? SAS can do the predictions for you, so you don't have to do it manually)

--
Paige Miller
Frequent Contributor
Posts: 89

Re: Modl not a full rank, dummy variables

Posted in reply to PaigeMiller
what do you mean by that? Is there a better way of obtaining a regression equation than from 'Parameter Estimate'?
Respected Advisor
Posts: 3,271

Re: Modl not a full rank, dummy variables

The only time I can think of where you need to write down the regression equation is to put it in a report.

 

If you are going to do calculations with the regression equation, do it in SAS. Do not try to take each value under estimate and use them yourself.

--
Paige Miller
Frequent Contributor
Posts: 89

Re: Modl not a full rank, dummy variables

Oh I think I get it. So if it is a Sunday it puts 0* for all days of the week right? So it's just intercept + 0*all days + the rest of the equation?
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 117 views
  • 2 likes
  • 3 in conversation