I follow two rules, when using self-defined formats:
One benefit, when using formats is: if a text has to be changed, you change it in one place (the format-definition) no using-dataset has to be updated.
Reducing the stored size of datasets is not so much important because of used disk space, but because of I/O throughput. Most of the time, SAS servers are I/O bound, but not CPU bound. Once you move data to in-memory servers, the effect becomes less pronounced.
But simply comparing
if var1 = 'myocardial infarction' then var2 = 'cardiovascular disease';
with
if var1 = 3 then var2 = 'cardiovascular disease';
from a coder's POV makes the question moot. You'll have more problems with the first, as it provides much more opportunity for typos.
Replacing the whole statement (and the associated if-then-else avalanche) with the simple use of a format reduces the code to a mere 10% or less from what it was before, and makes it failsafe. Maintaining a format from a lookup dataset automatically will reduce code complexity even further.
A mixed environment will always introduce levels of complexity caused by different ways of thinking about data. To remedy that, you need clear guidelines for how to jump the barrier, and those need to be communicated and followed.
EG "when data is prepared for R, replace categorical variables with the long texts by using put()".
If users repeatedly cause havoc by not following those guidelines, then their competence or willingness to do work must be questioned, with the usual consequences.
As you can guess, in the pure SAS environment here where I am responsible for, your problems do not exist, and formats are simply the way to do it. If a custom format is used in a dataset, and the format is not created and stored centrally, then the code that creates the format MUST be included in the package if data is handed over to someone else.
@Kurt_Bremser wrote:
A mixed environment will always introduce levels of complexity caused by different ways of thinking about data. To remedy that, you need clear guidelines for how to jump the barrier, and those need to be communicated and followed.
Very much an issue to maintain these two bits.
I hated working in an SPSS only shop because every release of SPSS meant rewriting many "how to" documents because SPSS would drop features and existing code had to be modified. Often taking days just to find the work around before being able to document the process. This was such a headache changing one of the reports a client liked that we kept a version of SPSS loaded on a couple of machines that was 6 versions out of date just to create that one report.
You make many valid and well thought out points As a lifelong SAS bigot, I notice a few considerations that should be, well, considered.
For some variables, the character equivalent has to be invented. For example, survey responses might use a scale of:
1 = strongly agree
10 = strongly disagree
If you have to make up translations for all 10 possible values, that's work. And it makes life difficult down the road if you want to consider a range of values such as 1 to 3, or if you want to take the average of the responses.
This points out another consideration ... who is going to check all the translations? In the case of a survey, the same translations might apply to dozens of variables. Who is volunteering to make sure all the dozens of variables were coded correctly?
For some variables, you will need multiple translations. You might need something like:
if var1=1 then do;
var2='myocardial infarction';
var3='cardiovascular disease';
var4='circulatory system';
var5='heart';
end;
Keeping track of which variable is which involves two tasks. First, you have to know the list of available character strings and select the right one. And second, when creating the character strings (assuming many values might categorize as "heart"), you have to verify that the spelling is identical for all relevant variables (for example, always "heart" and never "Heart"). I assume you would handle this by creating formats anyway, and using the formats instead of IF THEN statements to create the translations.
Finally, consider the possibilities of typos ... not just for a single value as was mentioned, but for a list of values. For example:
if var1 in (1, 3, 7) then var2='cardiovascular disease';
If you have to type out the character versions (instead of 1, 3, 7) the risk of typos increases, and with long values there is an impact on convenience Furthermore, a typo might not be noticed. The counts for 'cardiovascular disease' might be a bit low, but it's possible they are not noticeably low and nobody checks the spelling.
Finally, formats let you update many programs automatically. If you had a format that contained:
1, 3, 7 = 'cardiovascular disease'
what would happen if you came across another value that should be added:
1, 3, 7, 8 = 'cardiovascular disease'
If you use character strings, you need to update every program that refers to 'cardiovascular disease'. Hope you can find them all. If you use a central format instead, you just update the format definition. All the programs that refer to the format are automatically updated.
Anyway, food for thought.
@Astounding wrote:
You make many valid and well thought out points As a lifelong SAS bigot, I notice a few considerations that should be, well, considered.
For some variables, the character equivalent has to be invented. For example, survey responses might use a scale of:
1 = strongly agree
10 = strongly disagree
If you have to make up translations for all 10 possible values, that's work. And it makes life difficult down the road if you want to consider a range of values such as 1 to 3, or if you want to take the average of the responses.
I like the survey response example. In one set of surveys I worked with we had about 70% of the questions with the same Yes, No, Don't Know, Refused answer set. One Format covered the responses for literally hundreds of questions (similar survey over 30+ years). So we didn't need to recreate look-up lists for all of those.
And the corresponding INFORMATS for reading the initial data sets allowed setting the Don't Know and Refused to special missing in the first place so the remaining 1/0 coding for Yes/No allowed going directly to analysis without having to create yet another recoded variable.
And don't forget that some procedures really want numeric values, Proc Corr anyone?
In addition to the good points that have already been made:
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.