07-23-2014 11:48 AM
Hi, I'm trying to verify the significance of some variables (about 25 of them) against a DV (continuous) using proc glm. So I have classed the vars, and included them in the model step. I want ot see if each individually is significant against the DV. I though the Type III error in proc glm provided that, but if I was to do them each separately, I get different p-vals than if I do them as explained above wiht a type III. Could someonoe suggest a way to check each individually (I suppose using a macro) or an expalantion of Type III error ? Much appreciated. Thanks
07-23-2014 12:40 PM
I don't think there is a "unique" way to determine the significance of a factor, in the presence of other factors (unless the factors are orthogonal to one another). This really is an impossible task, because the factors are correlated with one another.
If X1 increases, and X2 also increases, is the change in Y due to X1 or X2 or both or neither? There is no logical way to uniquely determine the answer empirically, so statistics reflects that confusion.
If you are trying to predict using some statistical model, I recommend PROC PLS and keep all the variables in the model. If you are trying to understand the individual importance of each individual predictor, it can't be done.
By the way, Type III finds the significance of a variable after the effects of all other variables in the model have been removed ... which doesn't sound like what you are trying to do.
07-23-2014 12:49 PM
Thanks Paige, I'm trying to realize if from the expalantion you gave about the Type III, it may be exactly what I need. Let me give you my situation. I have a decision tree wiht 2 variables. Call them Product -> Loan Amount. I need to see if any of the 25 other (or new) variables can add any more to my R-squared of the decision tree. So check to see if Product -> Loan Amount -> New_Var1 and then again Product -> Loan Amount -> New_Var2 , and so on....the problem is that I don't wan tto do this test in my Decision trees 25 times, I just want to try it with the most significant varaibles wiht a p-val <0.05. Does this make sense, and does it make sense to use Type III in this case?
07-23-2014 03:32 PM
Type III is not what you want. Type III analyses take a model and determine the significance of each variable, having already adjusted for all the other variables in the model. Conceptually, it's comparing the full model to a model without that one particular variable. In your case, if you specified your model as with (product loanAmount new1 new2), Type III p-value for new1 is conceptually comparing (product loanAmount new2) to (product loanAmount new2 new1). Likewise, for new2 is comparing (product loanAmount new1) to (product loanAmount new1 new2). That is, if your variable in question was the very last added to the model, what effect did it have.
If I am interpretting you correctly, you actually are looking for comparing (product loanAmount) to (product loanAmount new1) to get a p-value for new1, and then comparing (product loanAmount) to (product loanAmount new2) to get a p-value for new2. If p-value1 is < p-value2, you want to add new1 to the model and not new2. That is, after product and loan amount, you want to see what is the next best variable to add to your model. This is what's called forward selection while forcing product and loanAmount to be in every model.
I don't believe PROC GLM can do this automatically. PROC REG can do this with SELECTION=FORWARD and INCLUDE=2 option in the model statement if you specify product and loanAmount first (include = 2 forces the first two listed variables in all models). GLM does not have a selection procedure. There is a separate procedure that does this called GLMSELECT; however, honestly, this might be more trouble learning than it's worth if you are not already familiar with selection criteria (my apologies if you are).
Your best bet is to just write a macro running a model for (product loanAmount newX) for each of your 25 newX variables. Gather the p-values from Type III for each run, and then you can sort that dataset to pick the smallest. Then you will have 25 p-values that give you the effect of each new variable on your initial (product loanAmount) model individually (aka, without including the other new variables).
Hope this helps!
07-23-2014 04:40 PM
While I agree with Kastchei on most of his explanation, I cannot agree with the idea of doing SELECTION=FORWARD. This has known drawbacks, and as I said above, the effect of these 25 variables cannot be uniquely determined ... yes you can run the algorithm just the way Kastchei described, and you will get numerical answers, but I think that's a very misleading result.
For example, if X1 and X2 are highly correlated to each other, and also correlated to your response, how can you tell if its X1 or X2 that is the cause of the correlation with the response? You can't do this empirically. But the algorithm will pick one or the other (and other algorithms may make different choices). Whatever method you come up with here involving some form of ordinary least squares regression ignores the correlation between the independent variables.
While I haven't thought about how to use PLS in your decision tree, the idea behind PLS is still appropriate, it accounts for the correlation between your independent variables, it performs prediction using the entirety of the 25 independent variables, and studies have shown that this method give better predictions than ordinary least squares (better predictions means lower mean square error of the predicted values) in the case of correlated independent variables. And it sounds like you want good predictions ...
07-24-2014 12:54 AM
All good points. I think the question really comes down to what podarum's goal is. podarum, do you have data for all 25 variables and will continue to have all this data going forward (e.g. mandatory fields on a form or application)? Is your goal to accurately predict the response variable? Or are you looking to identify a smaller number of variables that matter the most, perhaps because you cannot always get all 25 variables, and want to know which are the most important to not skip?
07-24-2014 12:24 PM
Sadly, this goal is not attainable in this situation, in my opinion. The method you choose may produce one result, but it will be misleading (as explained above) and other methods can produce other results.
07-24-2014 02:28 PM
So, this is a prediction scenario rather than a description scenario. MSPE is what you need to minimize--what sort of validation dataset do you have? If you have none, then PROC PLS and it's many crossvalidation scenarios is the way to go. If you do have a validation dataset, then GLMSELECT or QUANTSELECT, using LASSO or LAR or similar methods would be infinitely better than one by one or backward or forward or stepwise selection.
07-24-2014 03:46 PM
My impression, from his description, is that while s/he is concerned with prediction, s/he doesn't want to combine variables into factors (perhaps for ease of use). This would rule out PCR or PLS. I think GLMSELECT with LASSO would be better. However, in all those cases, podarum, be aware that you will essentially lose the ability to interpret the model. Especially with PCR and PLS, interpretation is essentially impossible. If you have no need for interpretation (e.g. an increase in 1 of this variable yields a change of X in our DV), then these are fine. If you require (really any) level of interpretation though, you are basically stuck with model selection based off of knowledge of the data and various selection criteria (p-values, AIC, BIC, PRESS, etc).
Would you guys agree with that?
07-25-2014 09:41 AM
I have seen many PLS models that are very interpretable. I have also seen many PLS models that are not easily interpretable.
My impression, from his description, is that while s/he is concerned with prediction, s/he doesn't want to combine variables into factors (perhaps for ease of use).
Yes, that certainly is the impression I get, which is why we are trying to convince podarum that task he/she is describing is difficult (if not impossible). We are essentially trying to get podarum to change his/her mind about what is needed here, and what is possible from a statistical point of view.
07-25-2014 03:35 PM
I understand; I'm just not sure we have enough information on his project yet to tell him what his objective should be. You may be right that the best option is to change methodology and go a hardcore prediction route. But there could be external, business, non-statistical reasons why this is not the case, even if it's as simple/silly as his supervisor telling him to use OLS. The ball's really in podarum's court as to whether s/he wishes to explain the project in more depth.