BookmarkSubscribeRSS Feed
Termpu
Fluorite | Level 6

Hi,

How do I perform model selection for logistic regression based on validation data?

 

Here is the problem: We need to build a model for prediction purposes - so a model that would generalize well on unseen data. The conventional way of obtaining a model with best out-of-sample performance is to fit (or train) several models with training data, see which works best on validation data (not used in fitting), and further evaluate the generalizability with a third unseen data set called the test set, right? The response is binary and I have dichotomuous and continuous explanatory variables.

 

Now, proc logistic allows me to perform automated variable selection with fwd, bwd, or stepwise selection methods - that all select variables based on training set criterion (as the training set is the input dataset). I thought that I found a solution from proc hpgenselect, that allows partitioning the input data into train, validation and test sets. There are two problems. One, the validation can be done on the validation set only with MAE as a criterion, while I would like to select based on accuracy. Two, our data has 1.6 million row. Meaning that hpgenselect takes forever and reserves a high percentage of our CPU resources (other scheduled runs are queued, and this is unwanted). Furhter more, there seems to be no statements or options that would save the model, such as outmodel in proc logistic. If I ever manage to fit the hpgenselect model, I dont want to do it again very soon!

 

I don't have a strong background in SAS programming and so far have spent hours reading the documentation. We are using SAS EnterpriseGuide 8.3, with SAS 9.4(M6). We don't have access to any of the procedures that use CAS engine.

 

How do I solve this? Right now I have just selected with proc logistic stepwise, and manually checked validation set performance.

 

This is my first post, please let me know if I missed something crucial.

 

Termpu

1 REPLY 1
SteveDenham
Jade | Level 19

Some of the issues with HPGENSEECT might be solvable if you can run in distributed mode.  However, I haven't the experience to offer help in that area.

 

The one thing I could suggest for HPGENSELECT is to use the ODS ParameterEstimates dataset.  This is pretty handy if you use method=lasso, but also can be adapted for method=stepwise.  It should list all of the selected parameter estimates included in the final model, and can be used for scoring new datasets.

 

One more thing, it appears that average squared error (ASE) is the only method available in HPGENSELECT for assessing the final model on the validation and test data.  That is a fair estimator of accuracy, although it is sensitive to "outlier" points.  The information criteria are essentially worthless for comparison as they are calculated on different data.

 

SteveDenham

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 661 views
  • 0 likes
  • 2 in conversation