Is this your first time using statistical procedures within SAS software? Are you new to statistics in general? Has it been a while since your last statistics course? Need a review of the multitude of statistical procedures found in SAS? If you answer yes to any of these questions, then this series is for you. In part 1, we discussed aspects of exploring and describing continuous variables. We investigated PROC SGPLOT, MEANS, UNIVARIATE, and CORR. In part 2, our discussion turned to the modeling aspects of continuous variables. Our focus was on PROC REG, GLM, GLMSELECT, and PLM. In part 3, we took our analysis to categorical variables. Specifically, we discussed procedures that allow us to investigate and explore any categorical variables in our data. In part 4, we will discuss modeling categorical variables using the popular procedure PROC LOGISTIC.
Back in part 2, we discussed modeling aspects where the response variable was continuous of type. Our situation, now, is a categorical response. Let’s focus on the simplest and most typical analysis, Binary Logistic Regression. Imagine that your categorical response variable contained exactly two possible outcomes. This could be Yes/No, 0/1, A/B, or any two levels of which we would like to model the probability of one of these two outcomes.
In binary logistic regression, an example of generalized linear models, our response variable follows a binary distribution. The expected value of the response variable, which is essentially the probability of the event of choice, passes through the logit link function to be modeled using the linear predictor. The logit converts the probability of the event to a log transformed odds that the event occurs. This will be important later as we look at the results of the model.
When using PROC LOGISTIC, SAS is already expecting your response variable to be categorical of type and does not require it to be mentioned in a CLASS statement. You will still need to declare categorical predictors as before in the CLASS statement. With a binary outcome, SAS also assumes a binary distribution and a logit link during the analysis. Those that would like to use a probit link are able to make that change if desired using the LINK= option on the MODEL statement.
One very important aspect of the code is seen immediately following your specification of the response variable on the MODEL statement. Using the EVENT= sub-option, you indicate to SAS which of the two response levels is the focus of the model, thus which way the predictions will be calculated. If you omit this, SAS will select the focus based on the ordering of the response levels. This can and will come back to cause problems if you are not watching for this. For completion of code, I always indicate the EVENT= item even if the default ordering would have accomplished the same thing.
Let’s look at an example.
proc logistic data=sashelp.heart plots=all;
class BP_status Chol_status sex Weight_status;
model status(event='Alive') = BP_status Chol_status sex Weight_status height;
run; quit;
In the first table, we get a summary of information pertaining to the data set and model setup. We are assured of the response variable by name and how many unique levels our categorical response variable contains. Since there are only two levels, SAS automatically proceeded with binary logit. The optimization technique is mentioned due to logistic regression using a method of maximum likelihood to solve for the parameter estimates. This is an iterative process. You can change this technique if there are convergence issues.
Next, we can see how many observations were read from the provided data source and how many were used in the modeling process. The difference in numbers indicates that several of the observations contained missing values for at least one of the predictor variables or the response variable present in the model. PROC LOGISTIC, like many SAS procedures, uses complete case analysis.
The response profile reveals the breakdown of level frequencies in the response variable. It also indicates the level of selection for the modeling process. This is important to make sure that SAS is modeling the correct level of interest.
The class level information table displays the design variables that were generated for each of the variables mentioned in the CLASS statement. By default, SAS uses deviation from the mean coding for its design structure. You can change this in the options of the CLASS statement. If you are only interested in predictions from your model, you do not have to worry about the design variable structure. This structure does allow you to answer some questions from the output directly and that can be helpful if you have certain questions of interest.
This is likely the most important item in the output information. This indicates that the modeling process did reach a convergence state and the answers following this table are allowed to be viewed and processed. If anything other than satisfied appears, you can try going back and adjusting the estimation options that regulate the modeling process to achieve a convergence state, or as mentioned above, try changing the estimation technique.
The Model Fit Statistics table provides two sources of information pulling from the Akaike Information Criterion and the Schwartz Criterion. First, note that there are two columns relating to the presence and absence of the covariates, or predictors, in the model. Comparing these two columns show, with the reduction of the values, that the presence of the covariates in the model is an improvement that just an intercept only model. (We always hope this to be true.) Second, when an additional model is run for this problem and declared converged, we can compare the Intercept and Covariates columns between the models and see which one appears to be a better fit to the data. This will be the model with the smaller information criterion.
Next is the Global Null Hypothesis table. This is the test of the overall significance of the model in question. Is there at least one predictor variable that is deemed significant in the model. In this example, there is… but which?
The Type 3 Analysis of Effects table answers the question of which terms are significant. We can see significance of each of the individual predictor variables within the model. Using this we can decide to potentially remove non-significant ones to possibly improve our overall model.
Estimated parameters for your model will be in the Analysis of Maximum Likelihood Estimates table. Remember that these parameters are on the logit or log odds scale (or probit form if you changed that). That means that the prediction from this equation will be an estimated logit and will need to be transformed to get a predicted probability. Score code for model implementation can be generated in the Logistic Procedure using the Score statement (not shown). The score code contains the syntax to do the transformation and generates probabilities, of ‘ALIVE’ in this case, for each scored patent.
Next are the Odds Ratio estimates table and graph. Odds ratio estimates transform maximum likelihood estimates to make them more interpretable. They represent the effect of a change in the input on the odds (not the log of odds) of the event of interest occurring. For example, the odds of ‘ALIVE’ are 48.2% higher for patients with borderline cholesterol vs. patients with high cholesterol according to the odds ratio estimates. These can be used to describe the strength of the relationship between the predictor variable and the event in question.
There are plenty of other tables and images that can be generated based upon the statements, options, and sub options that you provide in the code. If you would like to learn more about these items, check out our Categorical Data Analysis Using Logistic Regression course. Not only will you learn more about PROC LOGISTIC but you will also learn about the ability to use PROC GENMOD for these types of analyses.
You may have noticed that all the procedures mentioned above are from the SAS 9 Platform. If you are utilizing SAS Workbench, each of these procedures are available to you. If you are utilizing SAS Viya, you do not need to worry as all SAS 9 procedures are executable within SAS Viya using the Compute Server. But what if you wanted to utilize the power of the Cloud Analytic Service (CAS)? Are there versions of these statistical procedures that are CAS enabled? Yes, there are. Visit this link to find a list of SAS 9 procedures and their comparable CAS-enabled procedures.
Regardless of your use of the SAS 9 PROCs or the CAS-enabled PROCs, in SAS Viya or SAS Workbench, you will have the tools you need to model your continuous variables and be prepared to proceed with scoring or post-analysis. Give some of these procedures a try and let me know which is your favorite. See you in the next installment of this series.
Find more articles from SAS Global Enablement and Learning here.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.