Solved: Re: Dummy Variables in linear regression

Shivi82 · Posted 11-26-2015 05:02 AM

Dear Team,

I am running a linear regression model for one of my clientele. I am predicting the price of a commodity (diamond) based on some IV's.

Carat (weight) is one of the significant var. Along with weight we have some categories such as

Cut- this has 4 categories

Color- this has 5 categories

Now i need to build a model using interaction between carat and cut as IV's. The model has run successfully and some of the var have turned out to be insignificant based on significance value.

while discussing with one of the fellow co-worker he has suggested that in model i cannot use interaction variables (i.e. categiroes of variable) for example: interaction between carat and best cut<V GOOD>, carat and medium cut<GOOD> and only carat at the same time.

I am sure this would be an off topic as far as SAS is considered but I need some help & light on this from some top experts in the forum.

The output looks like this:

Model

Parameter Estimate Standard Error Significance Value (P)

Intercept 7.346 0.12 .000

Carat 1.392 .009 .000

V GOOD -.211 .016 .000

GOOD -.134 .013 .000

Please advice.

pearsoninst · Posted 11-26-2015 07:27 AM

Neg and Pos values are part of any regression analysis. Think like this..More you spend your time in Facebook your marks will be less so time Spend in FB will have a impact which is Negetive Value in Reg Model . If you remove this you will never have a correct Regression Model.The thing which is matter is the P value . I Wrote an artical on P value , very basics , if you like pleas go thru ...

http://tasbasbangalore.blogspot.com/2015/11/what-is-p-value-alien-exist-example.html

View solution in original post

PaigeMiller · Posted 11-26-2015 06:57 AM

I don't have any particular problem with doing this.

In effect, the interactions represent different slopes. For example, if you include the interaction between carat and best cut, this represents a different slope for the case where you use the best cut (and if you say the interaction is statistically significant, then I would say it belongs in the model).

--
Paige Miller

pearsoninst · Posted 11-26-2015 07:12 AM

This can be done very easily in SAS. Which procedure you are using for Regression Analysis?
Try Prom GLM or Proc Genmode . Check VIF,MC,etc . I am sure that you have checked all this ...

Shivi82 · Posted 11-26-2015 07:17 AM

Hi,

I have checked the VIF and tolerance limit for all these variables (IV's) and they all are within the threshold.

Shivi82 · Posted 11-26-2015 07:13 AM

Thanks for your expert advice. And another question when we refer to as a significance variable - if the IV is impacting the outcome either positively or negatively we still can go ahead and consider them as significance and keep them in the model.

The reason why i ask this is because on few instances i have read that a significance IV would only be the one which impacts the model positively. But what i feel is that along with positive, negative is even more important for a model developer when we present the results.

pearsoninst · Posted 11-26-2015 07:27 AM

Neg and Pos values are part of any regression analysis. Think like this..More you spend your time in Facebook your marks will be less so time Spend in FB will have a impact which is Negetive Value in Reg Model . If you remove this you will never have a correct Regression Model.The thing which is matter is the P value . I Wrote an artical on P value , very basics , if you like pleas go thru ...

http://tasbasbangalore.blogspot.com/2015/11/what-is-p-value-alien-exist-example.html

pearsoninst · Posted 11-26-2015 07:35 AM

Thanks and All the Best Shivi82. We all together can make this world more meaningful with Analytics. Keep Doing the good work 🙂

Shivi82 · Posted 11-26-2015 07:35 AM

Super. I think this is one of the best example or use case you made for me....

PaigeMiller · Posted 11-26-2015 08:47 AM

while discussing with one of the fellow co-worker he has suggested that in model i cannot use interaction variables

The reason why i ask this is because on few instances i have read that a significance IV would only be the one which impacts the model positively.

I don't know who is giving you this advice, but you might want to stop listening to this person.

--
Paige Miller

mgilbert · Posted 11-30-2015 01:58 PM

Don't mean to bump back to the top but a couple of thoughts:

As part of EDA you might want to look at these groupings. This could also be useful if you are hand-coding a decision tree to impute missing values (using mean in this instance). Note this only looks at the mean value of Carat across different cuts. One way could be:

* Carat by Cut
proc means data = dataset nmiss mean;
class CUT;
var CARAT;
run;

I echo what others say on interaction variables. Spend some time thinking about design - when I read dummy variables, I take that as levels of the categorical variable Cut (e.g. Cut_1, Cut_2, ... Cut_n) populated with 0's and 1's. But that doesn't quite get you to your interaction. If you then multiply the Cut_n by the Carat, and the results are stored in a new single variable, then there is no weighting for cuts better or worse as they're all being multipled by 1. Also, If multicollinearity is an issue, it will show up in your VIFs. Remember to leave at least one degree of freedom in the categorical variables... For instance, if there are five levels of cuts, your model should use no more than four (SAS will give a warning in the log and output from PROC REG). You could also explore binning of cuts.

Lastly, how you consider p-values and measures of GOF really, really, really depends on the purpose of your model.

A model built for statistical inference focuses on hypothesis testing of the sample. Here you want the "best" (remember accuracy is typically defined by your customer) model on the sample that fits within these parameters. You might give significant weight to p-values over other metrics.
A model built for predictive accuracy should predict well both in-sample (data you have) and out-of-sample (data you do not have). Personally, when building models for predictive accuracy, I care less about p-values and more about building and deploying a highly accurate model (this is also why I care less about if the intercept "makes sense"). Overfitting your model is a concern since it will be deployed out-of-sample. Cross-validation is extremely useful. For instance, take the in-sample data and do a random uniform split of it 70% and 30% (use a seed value for repeatability). Then train (build) your model on the 70% and test (deploy) your model on the 30%. Now judge the accuracy - if the metrics for accuracy are similar, then your model should not be overfit (I say should because sampling error might be present, but if your model has sufficient power you should be good). If you use MSE to evaluate the model, the MSE value from PROC REG in the ANOVA table is different than the MSE you'd use in this instance. And on that note, SAS uses the Sawa criterion in PROC REG for BIC not the Schwartz criterion (different formulas).

Michael