BookmarkSubscribeRSS Feed
rasmuslarsen
Obsidian | Level 7
 
 
 
What are the advantages to using formats as opposed to storing categorical- or text-information in character columns instead? 
 
At my workplace we use SAS, R, STATA and SPSS and other software. We sometimes have problems using SAS formats between different projects. The problems include:
  • Problems opening a SAS dataset created by other SAS users, because the formats could not be loaded. 
    • Granted, it is still possible to open that SAS dataset by specifying the noformat option, but then the information stored in the formats are lost. 
    • Granted, SAS users should always store a format catalogue together with a SAS dataset, but this is routinely forgotten in at my workplace, because we have a steady influx of new SAS users. 
    • Although storing a SAS format catalogue together with the dataset solves the problem, it adds complexity in storing and managing different versions of SAS data sets because two files are required to use the data, instead of just one file. 
  • Problems opening SAS datasets in R because the packages (the haven-package) does not always read the formats correctly. It reads character variables correctly however. 
  • Using a number to represent another value adds another layer of complexity, which would be avoided if the information was stored as a string. For example, if find it easier to read: if var1 = 'myocardial infarction' then var2 = 'cardiovascular disease'; than to read: if var1 = 3 then var2 = 'cardiovascular disease';
 
I can see that it is advantageous to use a format for dates, as opposed to having a string with the date, because it makes arithmetic and comparisons operations possible. 
Further, I can see that in the past, using formats would save disk space because numeric columns take less space than character columns. 
I am aware that he special-missing types (.A, .B .C etc) would not be possible using character columns. However the special missing only work for numeric-variables anyway, which detracts from their usability anyway. 
Further, it is my experience that using special missing variables confuses people more than they help people, because they add implicit exception handling to the proc freq procedure for example. 
However, at least in my experience, this is of no practical value, because disk space is so plentiful and cheap today (2019). 
 
I would like to hear if you agree or disagree with the critique of SAS formats described above, and the proposed solution (just storing categorical information in character-variables)?
I would also like to hear what you do, or would do in a similar situation?
Does using SAS formats work for you? What are the If, so what are your tactics for avoiding the above mentioned problems?
What do YOU see as the advantages and disadvantages of using SAS formats as opposed to storing categorical/text-like information in character-columns instead of numerical columns.

6 REPLIES 6
andreas_lds
Jade | Level 19

I follow two rules, when using self-defined formats:

  1.  Never attach self-defined formats to non-temporary datasets that are used by others if the format is not stored in a common catalog.
  2. Formats are applied as late as possible to preserve original values in data as long as possible.

One benefit, when using formats is: if a text has to be changed, you change it in one place (the format-definition) no using-dataset has to be updated.

 

Kurt_Bremser
Super User

Reducing the stored size of datasets is not so much important because of used disk space, but because of I/O throughput. Most of the time, SAS servers are I/O bound, but not CPU bound. Once you move data to in-memory servers, the effect becomes less pronounced.

 

But simply comparing

if var1 = 'myocardial infarction' then var2 = 'cardiovascular disease';

with

if var1 = 3 then var2 = 'cardiovascular disease';

from a coder's POV makes the question moot. You'll have more problems with the first, as it provides much more opportunity for typos.

Replacing the whole statement (and the associated if-then-else avalanche) with the simple use of a format reduces the code to a mere 10% or less from what it was before, and makes it failsafe. Maintaining a format from a lookup dataset automatically will reduce code complexity even further.

 

A mixed environment will always introduce levels of complexity caused by different ways of thinking about data. To remedy that, you need clear guidelines for how to jump the barrier, and those need to be communicated and followed.

EG "when data is prepared for R, replace categorical variables with the long texts by using put()".

If users repeatedly cause havoc by not following those guidelines, then their competence or willingness to do work must be questioned, with the usual consequences.

 

As you can guess, in the pure SAS environment here where I am responsible for, your problems do not exist, and formats are simply the way to do it. If a custom format is used in a dataset, and the format is not created and stored centrally, then the code that creates the format MUST be included in the package if data is handed over to someone else.

ballardw
Super User

@Kurt_Bremser wrote:

A mixed environment will always introduce levels of complexity caused by different ways of thinking about data. To remedy that, you need clear guidelines for how to jump the barrier, and those need to be communicated and followed.

 

Very much an issue to maintain these two bits.

 

I hated working in an SPSS only shop because every release of SPSS meant rewriting many "how to" documents because SPSS would drop features and existing code had to be modified. Often taking days just to find the work around before being able to document the process. This was such a headache changing one of the reports a client liked that we kept a version of SPSS loaded on a couple of machines that was 6 versions out of date just to create that one report.

Astounding
PROC Star

You make many valid and well thought out points  As a lifelong SAS bigot, I notice a few considerations that should be, well, considered.

 

For some variables, the character equivalent has to be invented.  For example, survey responses might use a scale of:

 

1 = strongly agree

10 = strongly disagree

 

If you have to make up translations for all 10 possible values, that's work.  And it makes life difficult down the road if you want to consider a range of values such as 1 to 3, or if you want to take the average of the responses.

 

This points out another consideration ... who is going to check all the translations?  In the case of a survey, the same translations might apply to dozens of variables.  Who is volunteering to make sure all the dozens of variables were coded correctly?

 

For some variables, you will need multiple translations.  You might need something like:

 

if var1=1 then do;
   var2='myocardial infarction';
   var3='cardiovascular disease';
   var4='circulatory system';
   var5='heart';
end;

Keeping track of which variable is which involves two tasks.  First, you have to know the list of available character strings and select the right one.  And second, when creating the character strings (assuming many values might categorize as "heart"), you have to verify that the spelling is identical for all relevant variables (for example, always "heart" and never "Heart"). I assume you would  handle this by creating formats anyway, and using the formats instead of IF THEN statements to create the translations.

 

Finally, consider the possibilities of typos ... not just for a single value as was mentioned, but for a list of values.  For example:

 

if var1 in (1, 3, 7) then var2='cardiovascular disease';

If you have to type out the character versions (instead of 1, 3, 7) the risk of typos increases, and with long values there is an impact on convenience  Furthermore, a typo might not be noticed.  The counts for 'cardiovascular disease' might be a bit low, but it's possible they are not noticeably low and nobody checks the spelling.

 

Finally, formats let you update many programs automatically.  If you had a format that contained:

 

1, 3, 7 = 'cardiovascular disease'

 

what would happen if you came across another value that should be added:

 

1, 3, 7, 8 = 'cardiovascular disease'

 

If you use character strings, you need to update every program that refers to 'cardiovascular disease'.  Hope you can find them all.  If you use a central format instead, you just update the format definition.  All the programs that refer to the format are automatically updated.

 

Anyway, food for thought.

ballardw
Super User

@Astounding wrote:

You make many valid and well thought out points  As a lifelong SAS bigot, I notice a few considerations that should be, well, considered.

 

For some variables, the character equivalent has to be invented.  For example, survey responses might use a scale of:

 

1 = strongly agree

10 = strongly disagree

 

If you have to make up translations for all 10 possible values, that's work.  And it makes life difficult down the road if you want to consider a range of values such as 1 to 3, or if you want to take the average of the responses.

 


I like the survey response example. In one set of surveys I worked with we had about 70% of the questions with the same Yes, No, Don't Know, Refused answer set. One Format covered the responses for literally hundreds of questions (similar survey over 30+ years). So we didn't need to recreate look-up lists for all of those.

And the corresponding INFORMATS for reading the initial data sets allowed setting the Don't Know and Refused to special missing in the first place so the remaining 1/0 coding for Yes/No allowed going directly to analysis without having to create yet another recoded variable.

And don't forget that some procedures really want numeric values, Proc Corr anyone?

FreelanceReinh
Jade | Level 19

In addition to the good points that have already been made:

 

  • Possibly the numeric codes provide a more appropriate sort order than the alphabetical order of the corresponding character values. Ordinal categorical variables such as Astounding's example of survey responses are a case in point. And if the alphabetical order is preferable, you can still use the option ORDER=FORMATTED, which is available in various procedures.
  • The same data can be summarized at different levels by means of different formats (e.g. survey responses 1-5='positive', 6-10='negative') or reported with abbreviated/more detailed labels, depending on the intended use.
  • Character variables involve the risk of truncation if new values exceed the defined length or if different lengths were defined in different datasets.
  • Numeric values (preferably integers) also avoid problems with leading blanks, invisible characters, encoding, etc.
  • Formats can be used as lookup tables ("merging without sorting").
  • Sometimes data need to be loaded into memory (cf. SAS hash object or SASFILE statement), which is often more limited than disk space.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1108 views
  • 11 likes
  • 6 in conversation