05-21-2015 12:20 PM
I am fitting spline regression models using GLMSELECT. The code looks like this:
PROC GLMSELECT DATA=DATESET;
EFFECT SPL=SPLINE(X/SPLIT DEGREE=3);
MODEL Y = X SPL/SELECTION=STEPWISE(CHOOSE=CV SELECT=SBC) CVMETHOD=INDEX(GROUP);
BY W Z;
Therefore, I will get around 200 models through the BY statement.
Now I want to summarized my results and want to produce a table contains information from each model. So my questions are:
1. I would like a table containing all the models and the variables used in each model. Can I produce a table with model name as the column and variables as the row?
2. How to output the RMSE, Coefficient Variation and other statistics in a table for all models.
3. By the way, I need to know what is the difference between CHOOSE = and SELECT =. In Proc Reg, only Select = is enough to select best model. How does CHOOSE= work in GLMSELECT procedure? By the reading, it seems that SELECT= will produce some models not one? How to understand it? If I want to get the best predictive models, should I set CHOOSE=CV and SELECT=CV?
Thank you very much.
05-22-2015 12:01 PM
You will have to look at the ODS OUTPUT tables (there is a list in the User's Guide right before the examples). Each table in the output corresponds to a different file that can be saved with the ods output statement. You may need to merge two or more of these in a post-model fitting step in a data statement. I don't have a specific example here because I haven't used GLMSELECT a lot. With a BY statement, these output files will be stacked with the results for each group identified.
But I have some serious concerns about the model you are fitting. It looks like you are trying to decide if one should use a linear model in X or a cubic spline, or both for each group (essentially trying to see if there is curvilinearity?). Your use of the SPLIT option will consider each term of the spline (knot) as separate terms. With splines, the individual terms don't mean too much (they are arbitrary in order to get predictions that do mean something). Ending up with the second of four (or whatever) terms in a spline is pretty meaningless. I would take out the SPLIT; that way, the spline will be considered as a single term in the model. I also don't see a point of treating X as a factor (CLASS statement). This can create very strange spline basis functions (with many many terms). I didn't even think this would work until I tried it now on some data. I think the results are meaningless. Plus, with X as a factor in your example, X itself will capture any nonlinearity, leaving nothing for the spline function to represent.
05-22-2015 01:04 PM
Thank you very much. I also concern if split is really needed in my case. Do you mean I can use SEPARATE if I have more than one variables in spline(x y z)?
By the way, the code is not exactly same as what I use. I did not use X as indicator and numeric variable at the same time.
05-22-2015 11:25 AM
One more question:
If I use /knotmethod = Multiscale split in my code, I got spli_Temperature_S0:5 in my model. I know it means the 5th basis on 0 scale. I am confusing. By reading the online material in SAS website, there would be 2^i basis in scale i. Therefore, if scale = 0, the basis should be 1. What 5 in spl_Temperature_S0:5 means?