The least absolute shrinkage and selection operator (LASSO) was developed by Tibshirani (1996) as an alternative to the ordinary least squares (OLS) method with two objectives in mind. The first was to improve prediction accuracy, and the second was to improve model interpretation by determining a smaller subset of regressors that exhibit the strongest effects. This example presents a fully Bayesian interpretation and implementation of the LASSO that provides estimates for the regression parameters and their variances and provides Bayesian credible intervals for the regression parameters that can guide variable selection.
The LASSO is commonly used to estimate the parameters in the linear regression model
where y is the n x 1 vector of responses, μ is the overall mean, X is the n x p matrix of standardized regressors, and is the n x 1 vector of independent and identically distributed normal errors with mean 0 and unknown variance σ2. The LASSO estimates of Tibshirani (1996) are the solution to the minimization problem
for some λ ≥ 0, where .
Tibshirani (1996) suggested that the LASSO estimates can be interpreted as posterior mode estimates when the regression parameters have independent and identical Laplace priors. Park and Casella (2008) consider a fully Bayesian analysis by using a conditional Laplace prior specification of the form
and the noninformative scale-invariant marginal prior π(σ2) = 1/σ2 on σ2. Conditioning of
is important because it guarantees a unimodal full posterior. Park and Casella (2008) also note that any inverse-gamma prior for σ2 maintains conjugacy.
Exploiting the fact that the Laplace distribution can be represented as a scale mixture of normal densities with an exponential mixing density, Park and Casella (2008) propose the following hierarchical representation of the full model:
The parameter μ can be given an independent, flat prior. After you integrate out τ21,...,τ2p the conditional prior on β
has the desired conditional Laplace distribution.
The Bayesian LASSO parameter λ can be chosen by using marginal maximum likelihood or an appropriate hyperprior. The example in the next section demonstrates the latter and considers, as suggested by Park and Casella (2008), the class of gamma priors on λ2,
This example from Park and Casella (2008) fits a Bayesian LASSO model to the diabetes data from Efron et al. (2004). In the original study, statisticians were asked to construct a model that predicted the response variable, Y, a quantitative measure of disease progression one year after baseline, from 10 covariates: Age, Sex, BMI, MAP, TC, LDL, HDL, TCH, LTG, and GLU. It was hoped that the model would produce accurate baseline predictions of response for future patients and that the form of the model would suggest which covariates were important factors in disease progression. The following SAS statements read the data and create the SAS data set Diabetes:
data diabetes;
input age sex bmi map tc ldl hdl tch ltg glu y;
sex=ifn(sex=2,1,0);
datalines;
59 2 32.1 101.00 157 93.2 38.0 4.00 4.8598 87.000 151
48 1 21.6 87.00 183 103.2 70.0 3.00 3.8918 69.000 75
72 2 30.5 93.00 156 93.6 41.0 4.00 4.6728 85.000 141
... more lines ...
60 2 24.9 99.67 162 106.6 43.0 3.77 4.1271 95.000 132
36 1 30.0 95.00 201 125.2 42.0 4.79 5.1299 85.000 220
36 1 19.6 71.00 250 133.2 97.0 3.00 4.5951 92.000 57
;
Before specifying the model in the MCMC procedure, you need to standardize the model’s regressors, excluding the indicator variable Sex. You can use the STDIZE procedure as follows to perform this task:
proc stdize data=diabetes out=std_diabetes;
var age bmi map tc ldl hdl tch ltg glu;
run;
The following statements specify the Bayesian LASSO in PROC MCMC:
ods graphics on;
ods output postintervals=intervals;
proc mcmc data=std_diabetes seed=45678 nmc=50000 propcov=quanew
monitor=(b0 beta1-beta10 tau1-tau10 sigma2 lasso)
outpost=posterior;
array D[10,10];
array beta[10] beta1-beta10;
array mu0[10];
array data[10] age sex bmi map tc ldl hdl tch ltg glu;
begincnst;
call identity(D);
call zeromatrix(mu0);
endcnst;
beginnodata;
lasso=sqrt(lambda);
b=lambda/2;
%macro loop;
%do k = 1 %to 10;
tau&k = exp(omega&k);
D[&k,&k]=sigma2*tau&k;
%end;
%mend loop;
%loop;
endnodata;
call mult(beta, data,xb);
parms lambda ;
prior lambda ~ gamma(1,scale=.1);
parms omega1-omega10;
prior omega: ~ expexpon(iscale=b);
parms sigma2 1;
prior sigma2 ~ igamma(shape = .1, iscale = .1);
parms b0 0;
prior b0 ~ general(0);
parms beta;
prior beta ~ mvn(mu0,D);
model y ~ normal(b0 + xb,var=sigma2);
run;
The ODS OUTPUT statement saves the posterior credible intervals in the SAS data set Intervals. The NMC= option in the PROC MCMC statement requests 50,000 MCMC iterations, excluding the burn-in iterations. A large sample is used because the posterior samples are highly autocorrelated. The PROPCOV= option in the PROC MCMC statement requests that the quasi-Newton method be used in constructing the initial covariance matrix for the Metropolis-Hastings algorithm. The OUTPOST= option saves the posterior samples in the data set Posterior.
The next four statements create arrays that are used in the model. The array D is the covariance matrix for the regression parameters Beta1–Beta10. The array Beta is the vector of the regression parameters Beta1–Beta10. The array Mu0 is the mean vector for the prior distribution of the regression parameters Beta1–Beta10. The array Data is the matrix of regressors, excluding the intercept.
The BEGINCNST and ENDCNST statements define a statement block within which PROC MCMC processes the programming statements only during the setup stage of the simulation. You can use the BEGINCNST and ENDCNST statement block to initialize the matrices D and Mu0. D is initially set to an identity matrix, and Mu0 is initialized as a zero vector.
The BEGINNODATA and ENDNODATA statements define a block within which PROC MCMC processes the programming statements without stepping through the entire data set. The programming statements are executed only twice: at the first and last observations of the data set. Within this statement block, the parameters Lasso and b are defined. The macro %LOOP repopulates the matrix D. The purpose of the parameters Omega1–Omega10 and their relationship with the parameters Tau1–Tau10 are explained later.
The next statement uses the MULT CALL routine to define the matrix XB, which contains the product of the regressors and the regression parameters Beta1–Beta10. That is, it contains the linear predictor, excluding the intercept.
The following block of statements declares the model parameters and assigns prior distribution to them. The parameter Lambda, which represents λ2, is specified to have a gamma distribution. The parameters Omega1–Omega10 are specified to have exponential exponential distributions. The parameters τ21,...,τ210 have exponential distributions, but modeling these parameters directly can cause convergence problems. Instead, the parameters Omega1–Omega10 are modeled directly, and within the macro %LOOP the parameters Tau1–Tau10, which represent τ21,...,τ210, are defined as being the exponential of Omega1–Omega10, respectively. The parameter Sigma2, which represents σ2, is specified to have an inverse-gamma distribution. The parameter B0, which represents μ, is specified to have an improper uniform distribution. The parameter vector Beta, which represents β, is specified to have a multivariate normal distribution with mean equal to 0 and variance matrix equal to D.
Finally, the MODEL statement specifies that the response variable Y have a normal distribution.
Output 1 shows that the Monte Carlo standard errors (MCSE) of each parameter are small relative to the posterior standard deviations (SD). A small MCSE/SD ratio indicates that the Markov chain has stabilized and that the mean estimates do not vary much over time.
Output 1: Monte Carlo Standard Errors
The MCMC Procedure
Monte Carlo Standard Errors | |||
---|---|---|---|
Parameter | MCSE | Standard Deviation |
MCSE/SD |
b0 | 0.1105 | 3.7685 | 0.0293 |
beta1 | 0.0769 | 2.8187 | 0.0273 |
beta2 | 0.2072 | 5.8461 | 0.0354 |
beta3 | 0.1029 | 3.1602 | 0.0326 |
beta4 | 0.1093 | 3.3479 | 0.0326 |
beta5 | 0.3238 | 11.1592 | 0.0290 |
beta6 | 0.3241 | 11.4013 | 0.0284 |
beta7 | 0.1018 | 3.5868 | 0.0284 |
beta8 | 0.2234 | 7.4085 | 0.0302 |
beta9 | 0.1765 | 6.2293 | 0.0283 |
beta10 | 0.0992 | 3.3421 | 0.0297 |
tau1 | 0.0875 | 3.7978 | 0.0230 |
tau2 | 0.0925 | 4.0235 | 0.0230 |
tau3 | 0.1082 | 3.9813 | 0.0272 |
tau4 | 0.1124 | 3.8682 | 0.0291 |
tau5 | 0.0934 | 3.7605 | 0.0248 |
tau6 | 0.1107 | 3.8139 | 0.0290 |
tau7 | 0.0922 | 3.6053 | 0.0256 |
tau8 | 0.1004 | 3.6653 | 0.0274 |
tau9 | 0.1105 | 4.1032 | 0.0269 |
tau10 | 0.0969 | 3.7154 | 0.0261 |
sigma2 | 1.1459 | 202.7 | 0.00565 |
lasso | 0.00400 | 0.1500 | 0.0266 |
Output 2 shows the “Effective Sample Sizes” table. The autocorrelation times for the parameters range from 1.59 to 62.83, and most of the efficiency rates are low. These results account for the relatively small effective sample sizes, given a nominal sample size of 50,000.
Output 2: Effective Sample Sizes
Effective Sample Sizes | |||
---|---|---|---|
Parameter | ESS | Autocorrelation Time |
Efficiency |
b0 | 1164.0 | 42.9555 | 0.0233 |
beta1 | 1343.4 | 37.2178 | 0.0269 |
beta2 | 795.7 | 62.8345 | 0.0159 |
beta3 | 942.3 | 53.0600 | 0.0188 |
beta4 | 938.5 | 53.2742 | 0.0188 |
beta5 | 1188.0 | 42.0892 | 0.0238 |
beta6 | 1237.5 | 40.4033 | 0.0248 |
beta7 | 1241.1 | 40.2856 | 0.0248 |
beta8 | 1100.0 | 45.4557 | 0.0220 |
beta9 | 1245.4 | 40.1475 | 0.0249 |
beta10 | 1134.0 | 44.0920 | 0.0227 |
tau1 | 1883.1 | 26.5514 | 0.0377 |
tau2 | 1892.7 | 26.4173 | 0.0379 |
tau3 | 1353.3 | 36.9456 | 0.0271 |
tau4 | 1183.6 | 42.2445 | 0.0237 |
tau5 | 1619.6 | 30.8725 | 0.0324 |
tau6 | 1186.1 | 42.1539 | 0.0237 |
tau7 | 1529.5 | 32.6909 | 0.0306 |
tau8 | 1333.9 | 37.4850 | 0.0267 |
tau9 | 1378.2 | 36.2783 | 0.0276 |
tau10 | 1468.9 | 34.0388 | 0.0294 |
sigma2 | 31287.3 | 1.5981 | 0.6257 |
lasso | 1409.3 | 35.4777 | 0.0282 |
The following SAS statements use the OUTPOST data set Posterior and the ODS OUTPUT data set Intervals to generate a table of the Bayesian LASSO parameter estimates, which are the modes of the posterior samples for B0 and Beta1–Beta10, and their respective 95% HPD intervals:
proc means data=posterior mode;
var b0 beta1-beta10;
output out=parameters(drop=_TYPE_ _FREQ_) mode(b0 beta1-beta10)=b0 beta1-beta10;
run;
proc transpose data=parameters out=parameters;
run;
data parameters;
length parameter $ 6;
set parameters(rename=(col1=mode _NAME_=Parameter));
label Parameter=;
index=_N_;
run;
proc sort data=parameters out=parameters;
by parameter;
run;
proc sort data=intervals out=intervals;
by parameter;
run;
data parameters(where=(~missing(mode)));
merge parameters intervals;
by parameter;
label parameter="Parameter" mode="Mode";
run;
proc sort data=parameters out=parameters;
by index;
run;
proc print data=parameters noobs label;
var parameter mode hpdlower hpdupper;
run;
Output 3 shows that the HPD intervals for the parameters Beta1, Beta5, Beta6, Beta8, and Beta10 all contain 0. Unlike what happens in the frequentist version of the LASSO, regression parameters are not set to 0, so the inclusion of 0 in the HPD interval is the only indication that a variable is a candidate for exclusion from the model. Based on this criterion, the variables Age, TC, LDL, TCH, and GLU are the leading candidates for exclusion from the model.
Output 3: Bayesian LASSO Parameter Estimates and 95% HPD Intervals
Parameter | Mode | HPDLower | HPDUpper |
---|---|---|---|
b0 | 162.2 | 155.2 | 170.1 |
beta1 | -1.4406 | -5.8311 | 5.2856 |
beta2 | -21.2417 | -33.7811 | -11.0211 |
beta3 | 26.5303 | 18.3849 | 30.6794 |
beta4 | 13.6935 | 9.4601 | 22.4973 |
beta5 | -10.1466 | -33.9135 | 9.7807 |
beta6 | 6.9991 | -17.1198 | 27.7032 |
beta7 | -13.3526 | -17.2501 | -3.3104 |
beta8 | 1.7061 | -14.5072 | 14.3417 |
beta9 | 25.8343 | 15.8578 | 40.0418 |
beta10 | 0.3158 | -2.5716 | 10.4408 |
Efron, B., Hastie, T. J., Johnstone, I. M., and Tibshirani, R. (2004), “Least Angle Regression (with Discussion),” Annals of Statistics, 32, 407–499.
Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010), “Penalized Regression, Standard Errors, and Bayesian Lassos,” Bayesian Analysis, 5, 369–412.
Park, T. and Casella, G. (2008), “The Bayesian Lasso,” Journal of the American Statistical Association, 103, 681–686.
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Ready to level-up your skills? Choose your own adventure.