BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Shivi82
Quartz | Level 8

Dear Team,

 

I am running a linear regression model for one of my clientele.  I am predicting the price of a commodity (diamond) based on some IV's.

Carat (weight) is one of the significant var. Along with weight we have some categories such as

Cut- this has 4 categories

Color- this has 5 categories

 

Now i need to build a model using interaction between carat and cut as IV's. The model has run successfully and some of the var have turned out to be insignificant based on significance value.

 

while discussing with one of the fellow co-worker he has suggested that in model i cannot use interaction variables (i.e. categiroes of variable) for example: interaction  between carat and best cut<V GOOD>, carat and medium cut<GOOD> and only carat at the same time.

I am sure this would be an off topic as far as SAS is considered but I need some help & light on this from some top experts in the forum.

 

The output looks like this:

Model

                       Parameter Estimate   Standard Error    Significance Value (P)

Intercept         7.346                                0.12                    .000

Carat              1.392                                .009                    .000

V GOOD        -.211                                 .016                    .000

GOOD           -.134                                 .013                     .000

 

Please advice.

1 ACCEPTED SOLUTION

Accepted Solutions
pearsoninst
Pyrite | Level 9

Neg and Pos values are part of any regression analysis. Think like this..More you spend your time in Facebook your marks will be less so time Spend in FB will have a impact which is Negetive Value in Reg Model  . If you remove this you will never have a correct Regression Model.The thing which is matter is the P value . I Wrote an artical on P value , very basics , if you like pleas go thru ...

 

http://tasbasbangalore.blogspot.com/2015/11/what-is-p-value-alien-exist-example.html

 

View solution in original post

9 REPLIES 9
PaigeMiller
Diamond | Level 26

I don't have any particular problem with doing this.

 

In effect, the interactions represent different slopes. For example, if you include the interaction between carat and best cut, this represents a different slope for the case where you use the best cut (and if you say the interaction is statistically significant, then I would say it belongs in the model).

--
Paige Miller
pearsoninst
Pyrite | Level 9
This can be done very easily in SAS. Which procedure you are using for Regression Analysis?
Try Prom GLM or Proc Genmode . Check VIF,MC,etc . I am sure that you have checked all this ...
Shivi82
Quartz | Level 8

Hi,

I have checked the VIF and tolerance limit for all these variables (IV's) and they all are within the threshold.

Shivi82
Quartz | Level 8

Thanks for your expert advice. And another question when we refer to as a significance variable - if the IV is impacting the outcome either positively or negatively we still can go ahead and consider them as significance and keep them in the model.

The reason why i ask this is because on few instances i have read that a significance IV would only be the one which impacts the model positively. But what i feel is that along with positive, negative is even more important for a model developer when we present the results.

pearsoninst
Pyrite | Level 9

Neg and Pos values are part of any regression analysis. Think like this..More you spend your time in Facebook your marks will be less so time Spend in FB will have a impact which is Negetive Value in Reg Model  . If you remove this you will never have a correct Regression Model.The thing which is matter is the P value . I Wrote an artical on P value , very basics , if you like pleas go thru ...

 

http://tasbasbangalore.blogspot.com/2015/11/what-is-p-value-alien-exist-example.html

 

pearsoninst
Pyrite | Level 9
Thanks and All the Best Shivi82. We all together can make this world more meaningful with Analytics. Keep Doing the good work 🙂
Shivi82
Quartz | Level 8

Super. I think this is one of the best example or use case you made for me.... Smiley Happy

PaigeMiller
Diamond | Level 26

while discussing with one of the fellow co-worker he has suggested that in model i cannot use interaction variables

The reason why i ask this is because on few instances i have read that a significance IV would only be the one which impacts the model positively.

 

I don't know who is giving you this advice, but you might want to stop listening to this person.

--
Paige Miller
mgilbert
Obsidian | Level 7

Don't mean to bump back to the top but a couple of thoughts:

 

As part of EDA you might want to look at these groupings. This could also be useful if you are hand-coding a decision tree to impute missing values (using mean in this instance). Note this only looks at the mean value of Carat across different cuts. One way could be:

 

* Carat by Cut
proc means data = dataset nmiss mean;
class CUT;
var CARAT;
run;

I echo what others say on interaction variables. Spend some time thinking about design - when I read dummy variables, I take that as levels of the categorical variable Cut (e.g. Cut_1, Cut_2, ... Cut_n) populated with 0's and 1's. But that doesn't quite get you to your interaction. If you then multiply the Cut_n by the Carat, and the results are stored in a new single variable, then there is no weighting for cuts better or worse as they're all being multipled by 1. Also, If multicollinearity is an issue, it will show up in your VIFs. Remember to leave at least one degree of freedom in the categorical variables... For instance, if there are five levels of cuts, your model should use no more than four (SAS will give a warning in the log and output from PROC REG). You could also explore binning of cuts.

 

Lastly, how you consider p-values and measures of GOF really, really, really depends on the purpose of your model.

  • A model built for statistical inference focuses on hypothesis testing of the sample. Here you want the "best" (remember accuracy is typically defined by your customer) model on the sample that fits within these parameters. You might give significant weight to p-values over other metrics.
  • A model built for predictive accuracy should predict well both in-sample (data you have) and out-of-sample (data you do not have). Personally, when building models for predictive accuracy, I care less about p-values and more about building and deploying a highly accurate model (this is also why I care less about if the intercept "makes sense"). Overfitting your model is a concern since it will be deployed out-of-sample. Cross-validation is extremely useful. For instance, take the in-sample data and do a random uniform split of it 70% and 30% (use a seed value for repeatability). Then train (build) your model on the 70% and test (deploy) your model on the 30%. Now judge the accuracy - if the metrics for accuracy are similar, then your model should not be overfit (I say should because sampling error might be present, but if your model has sufficient power you should be good). If you use MSE to evaluate the model, the MSE value from PROC REG in the ANOVA table is different than the MSE you'd use in this instance. And on that note, SAS uses the Sawa criterion in PROC REG for BIC not the Schwartz criterion (different formulas).

Michael

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 3426 views
  • 6 likes
  • 4 in conversation