One of the most useful capabilities in any data analysis is taking some data and grouping it into logical categories. Whether its numeric, date, or character, being able to take ABC, DEF, and GHI and group them to “Category 1” makes life a lot easier, and may potentially tease out relationships previously unseen.
There are any number of ways of doing this in SAS University Edition, but the one I wanted to focus on today is PROC FORMAT.
Get the Data
The data can be downloaded from the province of Nova Scotia’s Open Data site, here. Multiple files formats of the data are available; I chose CSV for Excel. I did clean up the data prior to importing it, specifically removing the 90th percentile columns and deleting the “Rolling totals” rows. Otherwise, the data was imported as it is.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
The data was already in a format that I could use, as it was in standard CSV format.
Here’s a few rows of the data as it was imported:
You’ll note the procedures are highly varied, and the Consult_Median column seems to have a wide range of values (including missing). I’d like to be able to have a little more logic applied to these for graphing and analysis.
The first step is to build the groups, and because there are over 100 distinct procedures, I’m going to pick on two – Knee Scope and Knee Scope with ACL Repair.
proc format; value $procedure 'Knee Scope' = 'Ortho' 'Knee Scope with ACL repair' = 'Ortho' other = 'Other'; run;
You’ll notice that I explicitly state the values I want grouped, and then I indicate other = ‘Other’. SAS University Edition is smart enough to know that by specifying other, I want all remaining values to be in the one group.
The next step is to apply my new format to my data using PROC PRINT. The noobs suppresses the column with the Row Number, and label just uses the variable names as the column headers.
I’m also specifying I want to see Year and Consult_Median in my output, without any specific formatting applied.
proc print data=work.import noobs label; label procedure='Procedure Type'; format procedure $procedure.; run;
Here is the new output, with “ORTHO” highlighted:
So that’s the first step – what about the numeric data? I’ll use the Consult_Median column, and I want to bucket it into weeks / months; I’m keeping the format for the Procedure in the code.
proc format; value $procedure 'Knee Scope' = 'Ortho' 'Knee Scope with ACL repair' = 'Ortho’ other = 'Other'; value consult 0 - 14 = '1-2weeks’ 15 - 30 = '2-4weeks' 30 - 60 = '1-2months' other = '2+months'; run;
Here I am specifying numeric ranges (notice without the ‘ around the actual values) and then
providing the value I want created.
In the PROC PRINT statement, I only have to add the new format to the consult_median column. Because the original values are numeric, I don’t need the $ like I do with the procedure (which is character).
proc print data=work.import noobs label; label procedure='Procedure Type'; format procedure $procedure. consult_median consult.; run;
Here’s my output with the brand new Consult_Median groups. You can obviously carry this through to Year (‘Before 2015’ and ‘After 2015’) Zones (‘Zones1_2’,’Zones3_4’) or any other variable.
As with everything, it is extremely important to validate, cross-check and revalidate your data. As an example, if you typed ‘Znoe 1’ for the Zone1_2 group, and had an “Other” bucket, all your Zone 1 data would end up in the wrong category and seriously throw your numbers off.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.