Hi all,
This is not really a ‘pure SAS’ question but hopefully some might find it interesting regardless. Moderators please feel free to delete if you feel it does not belong here.
Many of us receive data to analyse in either an excel, csv or other format, which is then imported to SAS for analysis. The content of these files is usually messy (to say the least) and we end up spending some time cleaning the data file before importing it to SAS, which is an unnecessary waste of time (I hope I am not the only one having this problem?). It is quite common that untidy data result in problems in coding execution and sometimes even errors in the results of analysis.
I thought it might therefore be good to create a document or a list of instructions to the person delivering the data to avoid this problem. I am hoping to draw from the wisdom of this society to make such instructions as complete and good as possible. Please feel free to change, edit or add suggestions:
It would be great to have input from the society, do you recommend different formats for variables names or contents? How do you communicate with those delivering the data to you to ensure that you do not end up with unnecessary file cleaning.
File format:
NEVER Excel. NEVER. Unless you like to wade in stinking manure. Excel files force you to use guessing tools on the SAS side, which also means cleaning up the **** afterwards. On top of that, serious limitations that can cause data loss (see my recent article in the Communities Library).
Instead:
Variable names: have to follow SAS V7 standards, period. The only thing I tolerate are the # characters used in DB/2, as I can easily and consistently replace them with underlines. This means
Good idea of documenting encodings (F=female, M=male), so it's easy to create formats for them; such encodings (if larger than a few entries, or dynamic) should be provided as separate tables for easy import into SAS
Dates: use ISO 8601 compliant dates/times/datetimes
File format:
NEVER Excel. NEVER. Unless you like to wade in stinking manure. Excel files force you to use guessing tools on the SAS side, which also means cleaning up the **** afterwards. On top of that, serious limitations that can cause data loss (see my recent article in the Communities Library).
Instead:
Variable names: have to follow SAS V7 standards, period. The only thing I tolerate are the # characters used in DB/2, as I can easily and consistently replace them with underlines. This means
Good idea of documenting encodings (F=female, M=male), so it's easy to create formats for them; such encodings (if larger than a few entries, or dynamic) should be provided as separate tables for easy import into SAS
Dates: use ISO 8601 compliant dates/times/datetimes
In addition to @Kurt_Bremser about documentation:
The ORDER of the variables in the source data set needs to stay the same in all files.
For character variables the maximum length that will be used/processed.
If at all possible try to arrange something to avoid use of " or ' characters in the body of text. Example: 8"x 10" (here these are measurements of an object in inches) , or Container "B" . The embedded quotes can cause serious headaches and are often not needed such as the quotes around the B.
The number of header rows (if any) should be agreed on. Yes SAS can read data starting from row X. But you shouldn't have to change that row from file to file.
NO summary rows (though this is more typical with Excel some people will send you a "report" type file with such).
If a value for data not present is to be included in the file, agree what it should be and use only one. Nothing more fun than tracing down mixtures of UNK, unknown, UK, N/A, NULL,
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.