Is this your first time using statistical procedures within SAS software? Are you new to statistics in general? Has it been a while since your last statistics course? Need a review of the multitude of statistical procedures found in SAS? If you answer yes to any of these questions, then this series is for you. In part 1, we discussed aspects of exploring and describing continuous variables. We investigated PROC SGPLOT, MEANS, UNIVARIATE, and CORR. In part 2, our discussion turned to the modeling aspects of continuous variables. Our focus was on PROC REG, GLM, GLMSELECT, and PLM. In part 3, we take our analysis to categorical variables. Specifically, we will discuss procedures that allow us to investigate and explore any categorical variables in our data.
Back in part 1, we saw there were several procedures that assisted us in understanding what was going on with continuous variables. In the categorical realm, one specific procedure stands out as the go-to procedure whenever someone has categorical data, PROC FREQ.
PROC FREQ can perform a diverse amount of exploration and analysis. Let’s start with the one-way frequency table. When given a new data set, it is important to determine if there might be any typographical errors within the data. For categorical data, this could include mixing capital and lower-case letterings, incorrect spellings of categorical levels, and inclusion of additional levels that do not exist.
One-way frequency tables allow you to see a listing of each distinct level of the categorical variable as well as its frequency, percent, cumulative frequency, and cumulative percent. Frequency is simply the exact count of the number of occurrences of that level that appears in the data. Percent is the calculated percentage of that level’s appearance compared to the overall sample size. Cumulative frequency and cumulative percent sum up the values of frequency and percent as you progress down the table. This assumes that the levels of the categorical variable are ordinal and are in their logical order. (Remember that SAS likes to organize these tables in alpha-numerical order.)
proc freq data=sashelp.heart;
table status BP_status Chol_status sex Weight_status;
run;
For a one-way frequency table, you simply list the variables that you wish to view in a list after the TABLE statement.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Note that if there are any missing observations within the data for the categorical variable in question, that total will be provided at the bottom of the frequency table.
Where are the images? I thought that SAS loved to share images from its procedures. Do not fear! Our friend, PLOTS= is still available to help us create images from PROC FREQ. However, there is something different about PROC FREQ and its usage of the PLOTS= option. Placing the PLOTS= option on the PROC statement within FREQ will not produce any images. This is because the PLOTS= option is part of the TABLE statement in PROC FREQ. I like to say that this is how PROC FREQ is a little freaky. (Please forgive the pun.) An example of this will follow.
But what if you want to check for a potential relationship between two categorical variables. Am I able to return to our friend PROC CORR? Sadly, no this is not available to categorical variables. Fortunately, PROC FREQ has us covered here as well. In this situation, we will request a cross-tabulation table.
In a cross-tabulation table, one categorical variable is the deemed the row variable and the other the column variable. These rows and columns create cells at their intersections. Within these cells, we, by default, are provided with frequency, percent, row percent, and column percent. Frequency is the count of observations in the data that have that specific row level and column level that created that cell position. Percent is the calculated percentage of the sample size to that cell’s frequency. Row percent takes the row total from which that cell is a member and calculates the percentage of the row total that is present in that specific cell. Column percent takes the column total from which that cell is a member and calculates the percentage of the column total that is present in that specific cell.
Row and column percentages are ways to determine a potential relationship visually however this would be using PROC EYEBALL and that procedure can be easily mislead. It is typical practice to place a potential predictor variable in the row and the potential response variable in the column. In this setup, we can look at the row percentages and see if there are differences across the level of the rows. But how different is different? Again, this is why we would like a more substantial statistical element to assist us. Thus, we have the chi-square test of association.
The chi-square test of association is calculated from the expected, under the assumption of no association, and observed counts of each cell. Each cell calculates its contribution to the final statistics, and the chi-square is the sum of the contributions across each cell in the table. We then have a p-value that will allow us to determine the significance of the association present between the two categorical variables.
proc freq data=sashelp.heart;
table status*(BP_status Chol_status) / plots=all chisq;
run;
Here is a bonus tip that will save some typing. If you have several predictor variables and one single response variable, or vice versa, you can utilize the distributive property and group all your row variables or column variables in a set of parentheses. In this form, the action of the cross-tabulation table repeats across all the set. In the example above, the variable status will be placed on the row position and for each analysis one of the variables of the set will be the column variable. Note that we have also included the PLOTS=all option. This will provide images created by PROC FREQ. The CHISQ option is what requests the chi-square test of association. This table will also provide other statistics including the Cramer’s V statistic which is good for discussing the strength of the association.
Before we complete this discussion, here is a bonus procedure that can assist you with your exploration of categorical variables. Assume that you have a significant association between two categorical variables, as determined by the PROC FREQ output, but now you are tasked with describing the relationship that is present. In this case, you can use correspondence analysis.
Correspondence analysis, PROC CORRESP, is an exploratory multivariate technique that converts a data matrix into a low-dimensional graphical display that shows the relationships between the rows and columns of the matrix. Each row and column are represented by a point in a Euclidean space determined from cell frequencies.
This example, we will examine the nature of the association between a two-way table whose rows are defined by combinations of the values of gender and age and whose columns are defined by movie. First, we will determine that there is a statistically significant association. In this example, you need to consider three variables (movie, age, and gender). One analysis strategy is to concatenate age and gender and analyze a two-way table in PROC FREQ where the columns correspond to the levels of movie and the rows correspond to combinations of the levels of age and gender.
data moviesc;
set mult.movies;
group = gender||age;
run;
proc freq data=moviesc;
tables group*movie / chisq;
weight count;
run;
The chi-square test for the contingency table of movie by the combinations of gender and age is 6386.864 with a p-value less than 0.0001. This reflects the large sample size rather than the strength of an association. Cramer’s V, a measure of the strength of the association, is relatively small at 0.1380.
proc corresp data=mult.movies cross=rows observed rp short;
tables gender age, movie;
weight count;
run;
Compare the points for movie to the points corresponding to the age and gender combinations. Points that are in the same direction away from the origin are associated with each other. The plot shows that:
For more information about PROC CORRESP, consider taking our multivariate analysis course.
You may have noticed that all the procedures mentioned above are from the SAS 9 Platform. If you are utilizing SAS Workbench, each of these procedures are available to you. If you are utilizing SAS Viya, you do not need to worry as all SAS 9 procedures are executable within SAS Viya using the Compute Server. But what if you wanted to utilize the power of the Cloud Analytic Service (CAS)? Are there versions of these statistical procedures that are CAS enabled? Yes, there are. Visit this link to find a list of SAS 9 procedures and their comparable CAS-enabled procedures.
Regardless of your use of the SAS 9 PROCs or the CAS-enabled PROCs, in SAS Viya or SAS Workbench, you will have the tools you need to model your continuous variables and be prepared to proceed with scoring or post-analysis. Give some of these procedures a try and let me know which is your favorite. See you in the next installment of this series.
Find more articles from SAS Global Enablement and Learning here.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.