How to Use Popular PROCs in SAS/STAT

3 Likes

How to Use Popular PROCs in SAS/STAT^® Q&A, Slides, Popular PROCs Course Notes and On-Demand Recording

Watch this Ask the Expert session to learn about the comprehensive set of tools that SAS/STAT offers, more than 100 procedures for statistical analysis, and how it is scalable to meet your needs.

Watch the webinar

Join Mike Patetta as he demonstrates Bayesian analysis, high-dimensional variable selection and survival analysis. During this webinar you will learn:

How to perform Bayesian analysis using PROC MCMC.
The high-dimensional variable selection methods that are in SAS/STAT.
How to fit survival models in PROC PHREG.
How these methods can help solve your research or business problems.

If you’d like to learn more about the three topics covered, please consider the following SAS Training courses:

Bayesian Analyses Using SAS®

Survival Analysis Using the Proportional Hazards Model

Supervised Machine Learning Procedures Using SAS® Viya® in SAS® Studio

The questions from the Q&A segment held at the end of the webinar are listed below. The slides from the webinar are attached along with course notes on the three topics covered.

Q&A

Does Goodness of Fit statistics include MDL (minimum description length) in PROC MCMC?

The goodness-of-fit statistic in PROC MCMC is the Deviance Information Criterion (DIC) which is the Bayesian alternative to the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion). It is a statistic where the smaller value indicates a better fit to the data set. The DIC can be applied to non-nested models and models that have random effects. The minimum description length (MDL) is a model selection principle where the shortest description of the data is the best model. There is an MDL learning algorithm that uses the statistical notion of information rather than algorithmic information. PROC MCMC does not incorporate the MDL learning algorithm.

Is it important to specify the prior for the variances in PROC MCMC?

Yes, but it depends on the model. If you have a variance parameter in the model, you need to specify the prior for the variance. You would need to do this in a mixed model for example.

Do categorical variables have to be indicator variables to use them in PROC MCMC?

Yes. There is no class statement in PROC MCMC so you will have to dummy code them in a DATA Step or create an array.

What does the Type 3 statistics you requested provide you that the output does not? The p-values looked remarkably similar.

There is some evidence that the likelihood ratio statistics (requested using the TYPE3(LR) option) might more closely approximate a chi square distribution in small to moderate sample sizes. Therefore, the likelihood ratio test is the preferred test compared to the default Wald tests, especially for small sample sizes.

Can you share your code examples?

The code examples are included in the course notes pdf attached to this post.

Is the HPGENSELECT equivalent to supervised machine learning in SAS Viya?

SAS Viya is a more modern version of the HP (High Performance) Procedures. They are not equivalent, but similar. Viya has more power and flexibility.

In LASSO (Least Absolute Shrinkage and Selection Operator) models, can you define variables that must stay in the model?

Yes, there is an include option and you can use it in LASSO selection in PROC GENSELECT.

Can we do these methods in SAS Enterprise Guide and Enterprise Miner?

The point-and-click interface of Enterprise Miner does not support the Bayesian Analysis shown in PROC MCMC. It does support high dimensional variable selection (sequential and penalized regression methods) and Discrete Time Survival Analysis. The point-and-click interface of Enterprise Guide does not support Bayesian analysis. However, it does support the fitting the Cox proportional hazards model and supports high dimensional variable selection (sequential and penalized regression methods).

Are models harder to interpret when LASSO selection is used?

Yes. If model interpretability is your primary goal, LASSO is not as useful as the sequential variable selection methods. LASSO is a penalized regression method that places a constraint on the size of the regression coefficients. This constraint causes the coefficient estimates to be biased, but it improves the overall prediction error of the model by decreasing the variance of the coefficient estimates and/or the predictions. Therefore, it is more useful when predictive accuracy is your primary goal.

Can I use a Gibbs sampler algorithm in PROC MCMC?

The Gibbs sampler method is an algorithm that sequentially samples from a joint distribution of two or more random variables. PROC MCMC uses the Gibbs sampler when you specify conjugate priors. If the posterior distributions are in the same family as the prior probability distribution, the prior and posterior are then called conjugate distributions and the prior is called a conjugate prior for the likelihood. Conjugate priors are preferred because it is possible to obtain closed-form solutions for the posterior distribution. In the Gibbs sampler, the samples are accepted 100% of the time since the prior and posterior are conjugate distributions and there is a closed-form solution for the posterior distribution.

I thought we would have an example of mixed modeling?

The Bayesian course has several examples of mixed models. The link for the course is:

https://support.sas.com/edu/schedules.html?crs=STBAY&ctry=US

After LASSO selection, are you supposed to rerun the logistic regression with the variables selected by the LASSO?

No, the LASSO selection method causes the coefficients to be biased to decrease the variance of the coefficient estimates. If model interpretability is the primary goal, I would recommend the sequential variable selection methods.

How would the Bayesian and standard logistic regression analysis of low birth weight differ?

If you use a non-informative prior, the inferences should be about the same. However, the Bayesian analysis will show the posterior summaries such as the mean, standard deviation, and percentiles of the posterior distribution for each parameter. You also get the 95% equal-tail interval, which corresponds to the 2.5^th and 97.5^th percentiles of the posterior distribution, and the 95% highest posterior density (HPD) interval, which is the interval in which most of the distribution lies. In Bayesian analysis you can always calculate the probability that the parameter estimate for each variable is greater than zero. This shows the advantage of Bayesian analysis as you can compute the probabilities directly instead of using p-values.

For High-Dimensional Variable Selection, how many variables does a dataset usually have to apply this method?

There is no minimum number of variables to use these methods. If you have many variables, I recommend to first reduce the redundant variables and then you use these high-dimensional variable selection methods to reduce the irrelevant variables. If you have very few variables (less than 10), I recommend fitting the full model and reducing the number of variables based on the model results.

Can bias be estimated and what about out of sample predictability?

PROC GENSELECT performs out-of-sample validation with the PARTITION statement. The model is fit on a training data partition and then the model is validated on a validation data partition. PROC GENSELECT reports the assessment statistics on both the training and validation data sets.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. PROC MCMC does not automatically estimate bias. However, known bias can be incorporated in the prior distribution. You would need to make one of the following assumptions: the value of the bias is known (this might be obtained through simulations), the bias has a known distribution with a mean of 0, or the bias is in one direction so that the known distribution has a nonzero mean.

I am trying to analyze data on PrEP (Preventative Exposure Prophylaxis) used in different states. I also want that data to include ethnicity. Can I use the Bayesian model or survival analysis?

You can certainly use Bayesian analysis since you can fit any model in that framework. If your response variable is time until an event, then survival analysis would be the preferred model.

I do not understand the difference between stop and choose in HPGENSELECT?

The stop specifies a criterion used to stop a selection process. The choose option chooses from a list of models at each step of the selection process the model that yields the best value of the criterion.

I did not realize linear mixed models were part of the discussion on Bayesian methods.

PROC MCMC can fit linear mixed models with the use of the RANDOM statement. The Bayesian course shows several examples of linear mixed models.

Can we obtain odds ratio from PROC HPGENSELECT?

There is no option to produce odds ratios in PROC HPGENSELECT. You would have to save the parameter estimates to a data set and then exponentiate the parameter estimates in a DATA Step.

I like the use of _numeric_ but you might want to warn the users to make sure that their 0 - 1 outcome was coded as character and not numeric. Otherwise, the standardization will standardize their outcome. Yes/No?

Yes, the outcome needs to be a character variable if you are going to use the _numeric_ keyword in the model to represent the numeric predictor variables.

What procedures do you suggest to reduce redundancy?

I recommend PROC VARCLUS to reduce the redundancy of your numeric predictor variables.

If I specify conjugate priors, will PROC MCMC default to Gibbs?

Yes, but PROC MCMC can detect conjugacy only if the model parameter (not a function or a transformation of the model parameter) is used in the prior and likelihood distributions. If the parameter enters the likelihood function through a symbol or a transformation, then PROC MCMC resorts to the default sampling algorithm even though conjugacy still holds in theory. The sampling algorithm information can be found in the Parameters table, which is part of the analysis output.

How do you limit variables before fitting a logistic regression model using the sequential method?

I recommend reducing redundancy of the predictor variables first before you use any sequential variable selection method. I prefer PROC VARCLUS to reduce the redundancy of your numeric predictor variables.

Recommended Resources

Recent Developments in Survival Analysis with SAS® Software

Survival Analysis Using SAS®: A Practical Guide, Second Edition

A Survey of Methods in Variable Selection and Penalized Regression

Model Selection Using Information Criteria (Made Easy in SAS®)

Introducing the BGLIMM Procedure for Bayesian Generalized Linear Mixed Models

Bayesian Concepts: An Introduction

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

How to Use Popular PROCs in SAS/STAT

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!