BookmarkSubscribeRSS Feed

Statistical Procedures in SAS: The Not Scary Overview (Part 1)

Started 2 weeks ago by
Modified 2 weeks ago by
Views 236

 

Is this your first time using statistical procedures within SAS software? Are you new to statistics in general? Has it been a while since your last statistics course? Need a review of the multitude of statistical procedures found in SAS? If you answer yes to any of these questions, then this series is for you. In this series, we will review aspects of continuous data analysis and categorical data analysis. Our discussion includes data exploration as well as response analysis (modeling). Let’s begin with data exploration within continuous data analysis.

 

Let’s start with what many call the most important part of data exploration, plotting your data using PROC SGPLOT. Much like the most important aspect of real estate is location, location, location, the most important part of data exploration is plot your data, plot your data, for the love of everything, plot your data.

 

Plotting your data provides a mechanism to visually inspect the data for possible data entry errors but also allows you to notice possible patterns or relationships among the data. SAS provides a list of various possible choices of graphics to generate depending on the data and questions you may be interested in answering using PROC SGPLOT. Here is a short list of a few of these plots:

 

1. Box and Whisker (sometimes called a box plot) takes the five number summary and creates a visual aspect to this. Observations that are considered outliers according to their relative distance from the box are represented by dots. This graph is great for comparing averages of groups within a sample.

 

2. Scatterplot takes two continuous variables and plots them as ordered pairs against each other. Typically, if one of these variables is the response (target) it is placed on the vertical axis while the predictor (input) variable is placed on the horizontal axis. The scattering of points allows us to determine a possible relationship between these variables by noting any trend (line) within the image. The spread of these points also provides information about the strength of any relationship noticed.

 

3. Histograms are great ways to check the distribution of one variable. Visually, you will be able to see the shape and spread of the values. The balancing point of the histogram will point you towards the mean of the data. You can also see skewness and symmetry within the shape of the data.

 

Let’s look at some examples of PROC SGPLOT. Within each procedure, you indicate which type of statistical graphics plot (SGPLOT) you would like to utilize. Each of these plots have their own sub-options that allow you to customize them.

 

01_damodl_blog5_sgplot_hist.png

02_damodl_blog5_sgplot_vbox.png

 

03_damodl_blog5_sgplot_scatter.png

 

04_damodl_blog5_sgplot_reg.png

 

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

PROC MEANS provides us with summary statistics for our data. By default, five summary values are displayed: N (observation count), Mean (the average), Min (minimum value), Max (maximum value), and Std Dev (the standard deviation). These are not the only summary statistics that PROC MEANS can provide. You are welcome to request specific summary statistics that you want from the data. However, once you take control of what you request, those will be the only summary statistics that will be presented to you in the output. A CLASS statement is also available within PROC MEANS to provide by-group processing. This means that the summary statistics can be broken down into levels dictated by the CLASS variables used.

 

Let’s look at some examples of PROC MEANS.

 

05_damodl_blog5_means_def.png

 

06_damodl_blog5_means_alt.png

 

07_damodl_blog5_means_mix.png

 

08_damodl_blog5_means_class.png

 

What if you want to dive deeper into the aspects of a single continuous variable? Give PROC UNIVARIATE a try. This procedure will display summary statistics (like PROC MEANS) but it goes even further into details of the variable. From statistical moments (mean, variance, etc.) to quantiles, to extreme observations, to tests of location, PROC UNIVARIATE gives a multitude of information about a single continuous variable.

 

PROC UNIVARIATE also contains the ability to test distribution types. You have the option to pick from distributions such as Normal, Beta, Gamma, Exponential, etc. The output will provide three tests of distribution (Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling) for the distribution requested. Please be mindful of sample sizes when using these tests and be sure to look at the histogram provided to get a full picture of the possible distribution of the variable in question.

 

Let’s look at an example of PROC UNIVARIATE. Partial output is shown below.

 

09_damodl_blog5_univar_code.png

 

10_damodl_blog5_univar_moments.png

 

11_damodl_blog5_univar_quantiles.png

 

12_damodl_blog5_univar_extreme.png

 

13_damodl_blog5_univar_hist.png

 

14_damodl_blog5_univar_gof.png

 

Are you wondering which possible continuous predictor variables would be helpful in a model for your continuous target? Is it possible that some or all of our continuous predictors could be related to each other? If this is your question, then PROC CORR is for you. This correlation procedure determines the strength and significance of linear relationships. Note that says LINEAR relationships. It is very important to inspect graphs like scatterplots to determine the presence of linear relationships before you try to use output from PROC CORR. Correlations for non-linear relationships do not exist. Predictor variables that show strong significant relationships with the response could be good candidates for later models you develop.

 

PROC CORR also allows you to look at relationships among the possible predictor variables. This helps us avoid possible collinearity issues and putting closely related predictor variables into the same model. This would cause issues for us later.

 

Let’s look at some examples of PROC CORR. When checking the correlation of a predictor variable to a response variable, the use of the WITH statement is helpful. The response variable is placed on the WITH statement while the predictor variables are placed on the VAR statement. Removal of the WITH statement will have PROC CORR perform the cross-correlation analysis for each pair of variables in the VAR statement.

 

15_damodl_blog5_corr_with.png

 

16_damodl_blog5_corr_withmatrix.png

 

17_damodl_blog5_corr_cross.png

 

18_damodl_blog5_cor_crossmatrix.png

 

You may have noticed that all the procedures mentioned above are from the SAS 9 Platform. If you are utilizing SAS Workbench, each of these procedures are available to you. If you are utilizing SAS Viya, you do not need to worry as all SAS 9 procedures are executable within SAS Viya using the Compute Server. But what if you wanted to utilize the power of the Cloud Analytic Service (CAS)? Are there versions of these statistical procedures that are CAS enabled? Yes, there are. Visit this link to find a list of SAS 9 procedures and their comparable CAS-enabled procedures.

 

From the provided link, you can see that PROC CORR becomes PROC CORRELATION, PROC MEANS becomes PROC MDSUMMARY, and PROC UNIVARIATE becomes PROC CARDINALITY and PROC MDSUMMARY.

 

Regardless of your use of the SAS 9 PROCs or the CAS-enabled PROCs, in SAS Viya or SAS Workbench, you will have the tools you need to explore your continuous variables and be prepared to proceed with your modeling. Give some of these procedures a try and let me know which is your favorite. See you in the next installment of this series.

 

 

Find more articles from SAS Global Enablement and Learning here.

Contributors
Version history
Last update:
2 weeks ago
Updated by:

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started