Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
One of the most useful capabilities in any data analysis is taking some data and grouping it into logical categories. Whether its numeric, date, or character, being able to take ABC, DEF, and GHI and group them to “Category 1” makes life a lot easier, and may potentially tease out relationships previously unseen.
There are any number of ways of doing this in SAS, but the one I wanted to focus on today is PROC FORMAT.
The data can be downloaded from the province of Nova Scotia’s Open Data site, here. Multiple files formats of the data are available; I chose CSV for Excel. I did clean up the data prior to importing it, specifically removing the 90th percentile columns and deleting the “Rolling totals” rows. Otherwise, the data was imported as it is.
The data was already in a format that I could use, as it was in standard CSV format.
Here’s a few rows of the data as it was imported:
You’ll note the procedures are highly varied, and the Consult_Median column seems to have a wide range of values (including missing). I’d like to be able to have a little more logic applied to these for graphing and analysis.
The first step is to build the groups, and because there are over 100 distinct procedures, I’m going to pick on two – Knee Scope and Knee Scope with ACL Repair.
proc format;
value $procedure 'Knee Scope' = 'Ortho'
'Knee Scope with ACL repair' = 'Ortho'
other = 'Other';
run;
You’ll notice that I explicitly state the values I want grouped, and then I indicate other = ‘Other’. SAS University Edition is smart enough to know that by specifying other, I want all remaining values to be in the one group.
The next step is to apply my new format to my data using PROC PRINT. The noobs suppresses the column with the Row Number, and label just uses the variable names as the column headers.
I’m also specifying I want to see Year and Consult_Median in my output, without any specific formatting applied.
proc print data=work.import noobs label;
label procedure='Procedure Type';
format procedure $procedure.;
run;
Here is the new output, with “ORTHO” highlighted:
So that’s the first step – what about the numeric data? I’ll use the Consult_Median column, and I want to bucket it into weeks / months; I’m keeping the format for the Procedure in the code.
proc format;
value $procedure 'Knee Scope' = 'Ortho'
'Knee Scope with ACL repair' = 'Ortho’
other = 'Other';
value consult 0 - 14 = '1-2weeks’
15 - 30 = '2-4weeks'
30 - 60 = '1-2months'
other = '2+months';
run;
Here I am specifying numeric ranges (notice without the ‘ around the actual values) and then
providing the value I want created.
In the PROC PRINT statement, I only have to add the new format to the consult_median column. Because the original values are numeric, I don’t need the $ like I do with the procedure (which is character).
proc print data=work.import noobs label;
label procedure='Procedure Type';
format procedure $procedure. consult_median consult.;
run;
Here’s my output with the brand new Consult_Median groups. You can obviously carry this through to Year (‘Before 2015’ and ‘After 2015’) Zones (‘Zones1_2’,’Zones3_4’) or any other variable.
As with everything, it is extremely important to validate, cross-check and revalidate your data. As an example, if you typed ‘Znoe 1’ for the Zone1_2 group, and had an “Other” bucket, all your Zone 1 data would end up in the wrong category and seriously throw your numbers off.
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
It is always good to remember that you can use formats to assign categories. You can do the same thing for numeric data, where the process is called "binning."
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.