08-08-2016 02:44 PM
I've been using PROC QUANTSELECT to perform variable selection. After using the process on an initial set of variables and arriving at a proposed model, I decided to add a handful of additional variables into the mix and run QUANTSELECT again (the only change that was made was the addition of 9 extra potential explanatory variables). This second run resulted in a different model, however none of the new variables appeared in the new model. At first I assumed that at least one of these new variables must have entered the model at some point in the selection process and then exited the model later down the pipeline, however it turns out that NONE of the new variables entered the new model at any point in the variable selection process. The new model kept three out of the five variables from the initial model, dropped two from the initial model, and added five more that weren't chosen by the original model but were in the original set of potential effects. The second model ended up being slightly better than the first, but for the life of me I can't figure out why it's not identical to the first model considering none of the new variables ever entered the model.
How is it possible that the unselected variables altered the result of PROC QUANTSELECT without ever appearing in the selection process? Or is the outcome of PROC QUANTSELECT somewhat stochastic by nature, and perhaps it would have generated an alternate model even if I hadn't entered any additional potential effects? If anyone could shed some light on the situation, I would be much obliged.
08-08-2016 04:03 PM
Use one of the DETAILS= options to see the details of candidates and selection steps.
The selection process is not stochastic, but when you add more variables you enlarge the parameter space in which you are performing an optimization. It is possible that the optimization algorithm takes a different path through parameter space.
Perhaps correlation in your variables is responsible for what you are seeing. These models are not unique. Suppose that the true model is
Y = X1 + 2*X2 + noise
Suppose that your data contains explanatory variable X1, X2 and X3 where
X3 = X1 + X2 + noise
is correlated with X1 and X2. If you run the model with (X1-X3), you might select X2 (with estimate 2) during the first step and X1 (with estimate 1) during the second step.
If you now add some additional variables that have no predictive power (essentially noise) and rerun the model with (X1-X10), you might end up selecting X3 with a coefficient of 2, followed by X1 with coefficient -1. You've just obtained seemingly different models. However, due to correlation in the variables, the models are essentially the same. Obviously this example is contrived, but I think something similar could happen with real data.
From what you've said, the two variables that were dropped and the five variables that were added probably have similar predictive power. Although correlation is one explanation, there might be others..
08-08-2016 05:05 PM
Your example certainly does appear to parallel my situation, and I agree that correlation is likely a significant contributor to what I am observing. In looking at the DETAILS= option, I see that one of the specifications allows you to see "entry and removal statistics for the top five candidates for inclusion or exclusion at each step". Does this imply that the new variables I introduced were significant enough to make it into the top five candidates somewhere along the way, just not enough to actually be selected, and that this altered the path of the selection process? While that would shed some light, I'm still not sure I understand the situation completely, as I'd still think that the "best" candidate at a particular step would remain the same regardless of whatever comparable (but inferior) effects made it into the top five list.
To extend this to your mathematical example, my question would be this: why would the alternate model include X3 instead of X2 if X4-X10 never made it into the model anywhere along the way? I understand the divergent models you specified are equivalent in terms of explanatory power, but I'm still having trouble grasping the mechanics of how the selection path diverged if the model selection summary output in SAS doesn't show X4, X5, or any of the other illusory effects at any of the steps prior to the selection of X3 in the model.
Thank you for your patience .
08-08-2016 08:00 PM
No, I was not suggesting that one of the new variables made it into the top five, although it's possible. Run the model to see.
I'm afraid I can't be more specific because I honestly don't know what is happening with your data. You haven't provided any details. What quantiles are you using? How many effects? Are there interaction or higher-order effects? What selection method are you using? Only by looking at your data and the details of the methods can you definitively discover what is happening. I'm just saying that what you see seems plausible.
08-08-2016 10:41 PM
I'm afraid I can't be more specific because I honestly don't know what is happening with your data. You haven't provided any details. What quantiles are you using? How many effects? Are there interaction or higher-order effects? What selection method are you using? Only by looking at your data and the details of the methods can you definitively discover what is happening.
Here's a snippet of code to distill the essence of what I am attempting to model:
PROC QUANTSELECT data=WORK.DATA seed=1; model y = x1 x2 x3 x4 x5 x6 x7 x8 x9 x10/ quantile = .50 selection = LASSO(select=AICC stop=AICC sh=5 choose=SBC ADAPTIVE) details = SUMMARY; run;
So basically, it's median regression with adaptive lasso. The actual number of effects is greater, but I think I can convey the same situation without enumerating the full list.
If I run the script as is, the following effects are selected: x1, x3, x4, x7, x8. If I introduce additional effects x11-x19 and run the script, the following effects are selected: x1, x3, x7, x2, x5, x6, x9, x10. None of effects x11-x19 appear in the selection summary output at any step of the selection process, however I will try your suggestion of expanding the output details and report back.
08-09-2016 01:59 PM
Just to update, I tried running the same process using DETAILS=ALL and then again with DETAILS=STEPS (with all options enabled), and the only additional information provided were the penalized and unpenalized parameter estimates at each step, along with a lamba range. I was unable to get it to produce a list of alternative candidate effects at each step.