Hi, How do I perform model selection for logistic regression based on validation data? Here is the problem: We need to build a model for prediction purposes - so a model that would generalize well on unseen data. The conventional way of obtaining a model with best out-of-sample performance is to fit (or train) several models with training data, see which works best on validation data (not used in fitting), and further evaluate the generalizability with a third unseen data set called the test set, right? The response is binary and I have dichotomuous and continuous explanatory variables. Now, proc logistic allows me to perform automated variable selection with fwd, bwd, or stepwise selection methods - that all select variables based on training set criterion (as the training set is the input dataset). I thought that I found a solution from proc hpgenselect, that allows partitioning the input data into train, validation and test sets. There are two problems. One, the validation can be done on the validation set only with MAE as a criterion, while I would like to select based on accuracy. Two, our data has 1.6 million row. Meaning that hpgenselect takes forever and reserves a high percentage of our CPU resources (other scheduled runs are queued, and this is unwanted). Furhter more, there seems to be no statements or options that would save the model, such as outmodel in proc logistic. If I ever manage to fit the hpgenselect model, I dont want to do it again very soon! I don't have a strong background in SAS programming and so far have spent hours reading the documentation. We are using SAS EnterpriseGuide 8.3, with SAS 9.4(M6). We don't have access to any of the procedures that use CAS engine. How do I solve this? Right now I have just selected with proc logistic stepwise, and manually checked validation set performance. This is my first post, please let me know if I missed something crucial. Termpu
... View more