This is my second tip in a series of tips on penalized regression. It focuses on using the cross validation approach and validation data approach for tuning the popular penalized regression method, LASSO.You can find my first post here: Tip: Top five reasons for using penalized regression for modeling your high-dimensional data.
For linear models that have a continuous target, the solution to a LASSO problem is the set of regression coefficients that minimizes the residual sum of squared errors with respect to a constraint on the size of the regression coefficients:
In LASSO selection, the sizes of the regression coefficients are controlled by the tuning parameter t. As t decreases, the regression coefficients continuously shrink. If the shrinkage is big enough, some regression coefficients are set exactly to zero, enabling LASSO to perform variable selection and estimation at the same time.
The LARS node on the Model tab of SAS Enterpriser Miner uses the LARS (least angle regression) algorithm to efficiently produce the entire sequence of candidate models for LASSO:
Each candidate model in the sequence corresponds to the solution of the LASSO problem for a specific tuning parameter value. For example, you can think of M(t0) as the simplest candidate model with no predictors when t = t0, and you can think of M(tk) as the most complex candidate model, maximum likelihood model with all the predictors, when t = tk
In order for LARS node to choose one model as the preferred model, you need to specify a tuning method. Tuning methods estimate the prediction error of each candidate model and choose the model that yields the minimum estimated prediction error. For example, if you use validation data as the tuning method, the average squared error on the validation data is calculated for each candidate model, and the model that yields the smallest error is selected. Similarly, if you use cross validation as the tuning method, the cross validation error of each candidate model is calculated, and the model that yields the minimum error is selected.
Prostate Data Example
Let’s see how tuning LASSO selection via validation data and cross validation work for analyzing the commonly used Prostate data set. This set contains observations from 97 prostate cancer patients (Stamey et al. 1989), where the target is the level of prostate-specific antigen (lspa) and the input variables are eight clinical predictors: logarithm of the cancer volume (lcavol), logarithm of prostate weight (lweight), age (age), logarithm of the amount of benign prostatic hyperplasia (lbph), seminal vasicle invasion (svi), logarithm of capsular penetration (lcp), Gleason score (gleason), and percentage of Gleason scores of 4 or 5 (pgg45). All the variables (including the target) in Prostate data are interval variables.
Tuning via validation data requires you to set aside an additional validation data set, whereas tuning via cross validation does not. Cross validation uses part of the training data to fit the model and a different part to estimate the prediction error. The following SAS Enterprise Miner diagram uses the Data Partition node available on the Sample tab to partition the data set as described below:
Two LASSO models are fit by using the LARS node on the Model tab. The model selection criterion for the LASSO_Validation node is specified as validation, and the model selection criterion for the LASSO_CV node is specified as cross validation. The resulting models are compared by using the Model Comparison Node.
The “Fit Statistics” table of the Model Comparison node shows that the resulting model of LASSO when tuned via cross validation generalizes to the new data batter then the model obtained when LASSO is tuned via validation data.
If you want to reproduce the results of this analysis, you can find the prostate data here.
Although best practices often recommend using validation data for tuning, this example shows that tuning via cross validation can actually provide a better predictive model. This can be explained by the significant decrease in the amount of training data when you set aside an extra 30% of the data as a validation set, causing the training size to drop from 77 to 48. This is a significant decrease when you have eight input variables. In situations where you do not have sufficient data to create both a sizable training set and a validation set that represents the predictive population well, cross validation is a powerful tuning method.
In addition to cross validation or validation data approach for fitting linear and logistic regression models, you can also use likelihood-based criterion such as AIC or SBC to tune your penalized regression method. These methods mathematically adjust the likelihood function by placing some penalty based on the sample size and the model complexity.
Although this post focuses on tuning LASSO, cross validation and validation data approaches are general tuning approaches that can be used to tune other modern regression methods and machine learning algorithms. For example, the tuning parameter controls the number ‘k ’in k-nearest-neighbor algorithm, and when to stop pruning back in a decision tree model. The idea is always the same: the validation error or the class validation error are calculated for each candidate model that is obtained using a set of possible tuning parameters, and the model with the minimum error is selected as the preferred model.
My next post will focus on cross validation in greater detail. Stay tuned if you want to learn more about cross validation.
Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., and Yang, N.1989. "Prostate Specific Antigen in the Diagnosis and Treatment of Adenocarcinoma of the Prostate II Radical Prostatectomy Treated Patients." Journal of Urology