What are the advantages to using formats as opposed to storing categorical- or text-information in character columns instead?
At my workplace we use SAS, R, STATA and SPSS and other software. We sometimes have problems using SAS formats between different projects. The problems include:
Problems opening a SAS dataset created by other SAS users, because the formats could not be loaded.
Granted, it is still possible to open that SAS dataset by specifying the noformat option, but then the information stored in the formats are lost.
Granted, SAS users should always store a format catalogue together with a SAS dataset, but this is routinely forgotten in at my workplace, because we have a steady influx of new SAS users.
Although storing a SAS format catalogue together with the dataset solves the problem, it adds complexity in storing and managing different versions of SAS data sets because two files are required to use the data, instead of just one file.
Problems opening SAS datasets in R because the packages (the haven-package) does not always read the formats correctly. It reads character variables correctly however.
Using a number to represent another value adds another layer of complexity, which would be avoided if the information was stored as a string. For example, if find it easier to read: if var1 = 'myocardial infarction' then var2 = 'cardiovascular disease'; than to read: if var1 = 3 then var2 = 'cardiovascular disease';
I can see that it is advantageous to use a format for dates, as opposed to having a string with the date, because it makes arithmetic and comparisons operations possible.
Further, I can see that in the past, using formats would save disk space because numeric columns take less space than character columns.
I am aware that he special-missing types (.A, .B .C etc) would not be possible using character columns. However the special missing only work for numeric-variables anyway, which detracts from their usability anyway.
Further, it is my experience that using special missing variables confuses people more than they help people, because they add implicit exception handling to the proc freq procedure for example.
However, at least in my experience, this is of no practical value, because disk space is so plentiful and cheap today (2019).
I would like to hear if you agree or disagree with the critique of SAS formats described above, and the proposed solution (just storing categorical information in character-variables)?
I would also like to hear what you do, or would do in a similar situation?
Does using SAS formats work for you? What are the If, so what are your tactics for avoiding the above mentioned problems?
What do YOU see as the advantages and disadvantages of using SAS formats as opposed to storing categorical/text-like information in character-columns instead of numerical columns.
... View more