05-27-2015 02:11 PM
I am using CLASS statement with PARAM =REF option in proc logistic to include categorical variables. My question - When i run PROC LOGISTIC with Selection = STEPWISE, it does not check significance of LEVELS (GROUPS) of a categorical variable. It only checks whether a caterical variable as a whole is significant or not. In other words, even if a category of a categorical variable is insignificant, it does not exclude it. But if create dummy variables with reference category manually, it removes the dummy variable that is insignificant. I understand it considers it as a separate variable itself. But isnt it statistically incorrect? Any workaround?
05-28-2015 06:55 AM
Everything in STEPWISE is, at best, highly questionable and, at worst, outright wrong.
However, here, you have shown that you can make stepwise behave in either of two ways: Treat the categorical variable as a single variable or treat each level as a single variable. I recommend the first. Perhaps you want to exclude any variable that is insignificant at any level? I think that would be an (added) mistake, but you could certainly do it by hand (e.g. by removing that variable from the list).
05-28-2015 02:16 PM
You recommend backward or forward selection? I don't want to remove a variable. I want to remove that level from a variable. It may overestime / underestimate my predicted probability.
05-28-2015 02:27 PM
It appears that you want to collapse levels within a categorical variable, but I may be misinterpreting.
Why would you want to do that? Please explain.
05-28-2015 02:53 PM
It's a marketing (churn) model. Most of the significant variables are continuous and only two character variables are appearing and they make sense in terms of business logic and statistical significance. So i was just checking their significance so i put them in CLASS statement with PARAM = REF option. And run stepwise, some levels are coming out insignificant at 5% level, even 10% level. SO i thought better to ignore these categories (levels). But SAS does not check levels while selecting variables via STEPWISE or any selection technique. I guess it's better to ignore these levels and make model more parsimonious with low degree of freedom.
05-28-2015 06:13 PM
No, I don't think it's better to delete some levels of a categorical variable. That winds up being an uninterpretable model.
E.g. suppose the variable is race and you have White, Black, Asian, Other. Suppose only White and Asian are significant. Then if you delete the other levels, you are comparing Whites to Asians without controlling for Black or Other. Keep all levels.
Parsimony is often the enemy
05-29-2015 09:24 AM
Good point, Peter, about parsimony.
Proceeding from the maxim "All models are wrong, but some models are useful" using parsimony as the only tool to select a model is, at least to me, akin to choosing the nearest rock as a weapon when a dragon attacks, while ten feet farther away is a sword designed especially for dragon slaying. It may take a little more work to get to the sword, and it takes some skill to use it, but one is far likelier to be happy with the results.
05-28-2015 03:30 PM
It sounds like you are looking for the SPLIT option, which is supported in the CLASS statement of HPLOGISITC and HPGENSELECT.
I think most (all?) of the HP regression procedures that support variable selection also support the SPLIT option.
05-28-2015 05:21 PM
Thanks! No, i don't want any interaction between variables. It's a dummy variable with K-1 coding. Setting one value as a reference category. And then evaluating significance of each categories of a variable.
10-21-2015 05:31 PM
It is the same thing. Since your reference level is not part of your regression (dropped), removing insignificant dummy is essentially the same as combining it with your reference. So you just have new reference variable implicitly.