Dear Team,
I am running a linear regression model for one of my clientele. I am predicting the price of a commodity (diamond) based on some IV's.
Carat (weight) is one of the significant var. Along with weight we have some categories such as
Cut- this has 4 categories
Color- this has 5 categories
Now i need to build a model using interaction between carat and cut as IV's. The model has run successfully and some of the var have turned out to be insignificant based on significance value.
while discussing with one of the fellow co-worker he has suggested that in model i cannot use interaction variables (i.e. categiroes of variable) for example: interaction between carat and best cut<V GOOD>, carat and medium cut<GOOD> and only carat at the same time.
I am sure this would be an off topic as far as SAS is considered but I need some help & light on this from some top experts in the forum.
The output looks like this:
Model
Parameter Estimate Standard Error Significance Value (P)
Intercept 7.346 0.12 .000
Carat 1.392 .009 .000
V GOOD -.211 .016 .000
GOOD -.134 .013 .000
Please advice.
Neg and Pos values are part of any regression analysis. Think like this..More you spend your time in Facebook your marks will be less so time Spend in FB will have a impact which is Negetive Value in Reg Model . If you remove this you will never have a correct Regression Model.The thing which is matter is the P value . I Wrote an artical on P value , very basics , if you like pleas go thru ...
http://tasbasbangalore.blogspot.com/2015/11/what-is-p-value-alien-exist-example.html
I don't have any particular problem with doing this.
In effect, the interactions represent different slopes. For example, if you include the interaction between carat and best cut, this represents a different slope for the case where you use the best cut (and if you say the interaction is statistically significant, then I would say it belongs in the model).
Hi,
I have checked the VIF and tolerance limit for all these variables (IV's) and they all are within the threshold.
Thanks for your expert advice. And another question when we refer to as a significance variable - if the IV is impacting the outcome either positively or negatively we still can go ahead and consider them as significance and keep them in the model.
The reason why i ask this is because on few instances i have read that a significance IV would only be the one which impacts the model positively. But what i feel is that along with positive, negative is even more important for a model developer when we present the results.
Neg and Pos values are part of any regression analysis. Think like this..More you spend your time in Facebook your marks will be less so time Spend in FB will have a impact which is Negetive Value in Reg Model . If you remove this you will never have a correct Regression Model.The thing which is matter is the P value . I Wrote an artical on P value , very basics , if you like pleas go thru ...
http://tasbasbangalore.blogspot.com/2015/11/what-is-p-value-alien-exist-example.html
Super. I think this is one of the best example or use case you made for me....
while discussing with one of the fellow co-worker he has suggested that in model i cannot use interaction variables
The reason why i ask this is because on few instances i have read that a significance IV would only be the one which impacts the model positively.
I don't know who is giving you this advice, but you might want to stop listening to this person.
Don't mean to bump back to the top but a couple of thoughts:
As part of EDA you might want to look at these groupings. This could also be useful if you are hand-coding a decision tree to impute missing values (using mean in this instance). Note this only looks at the mean value of Carat across different cuts. One way could be:
* Carat by Cut proc means data = dataset nmiss mean; class CUT; var CARAT; run;
I echo what others say on interaction variables. Spend some time thinking about design - when I read dummy variables, I take that as levels of the categorical variable Cut (e.g. Cut_1, Cut_2, ... Cut_n) populated with 0's and 1's. But that doesn't quite get you to your interaction. If you then multiply the Cut_n by the Carat, and the results are stored in a new single variable, then there is no weighting for cuts better or worse as they're all being multipled by 1. Also, If multicollinearity is an issue, it will show up in your VIFs. Remember to leave at least one degree of freedom in the categorical variables... For instance, if there are five levels of cuts, your model should use no more than four (SAS will give a warning in the log and output from PROC REG). You could also explore binning of cuts.
Lastly, how you consider p-values and measures of GOF really, really, really depends on the purpose of your model.
Michael
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.