Hi all,
This is not really a ‘pure SAS’ question but hopefully some might find it interesting regardless. Moderators please feel free to delete if you feel it does not belong here.
Many of us receive data to analyse in either an excel, csv or other format, which is then imported to SAS for analysis. The content of these files is usually messy (to say the least) and we end up spending some time cleaning the data file before importing it to SAS, which is an unnecessary waste of time (I hope I am not the only one having this problem?). It is quite common that untidy data result in problems in coding execution and sometimes even errors in the results of analysis.
I thought it might therefore be good to create a document or a list of instructions to the person delivering the data to avoid this problem. I am hoping to draw from the wisdom of this society to make such instructions as complete and good as possible. Please feel free to change, edit or add suggestions:
File format: should be either .csv or xlsx.
Data sheets
The first data sheet should be called ‘data’ and should contain the data (see below)
The second datasheet should be called ‘dictionary’ and should contain the labels for the variables, e.g if there are two variables in the datasheet, gender(1,0) and colour(1,2,3) then the dictionary sheet should indicate that gender(1=female, 0=male), colour(1=red, 2=green, 3=blue) and so forth.
The third datasheet should be called ‘analysis’, and should list all the required analysis and indicate which variables to use, such as:
Univariate analysis
List of variables to analyse
Strata variable
Logistic regression
Dependant variable=’variable name’
Independent variables to include =’list of variable’
Survival analysis
Time variable=’variable name’
Entry time variable =’variable name’
Exit variable=’variable name’
Censor variable=’variable name’
independent variables for adjustment=’variable list’
Variable names:
Should be short and concise (maximum of two words). A few months ago I received an excel file with some variables as long as seven words! Not fun writing a long variable name in code.
All variable names should be low case. It is sometimes hard to keep track of combinations of capitals and small letters leading to errors in the execution.
No spaces in variable names (I really hate this), ideally use underscore instead.
Data contents:
Do not mix numbers and text in the same variable
Use 1 and 0 instead of yes/ no (thoughts?)
Do not enter any text or symbols for missing values, leave the cells empty.
Dates (this is another killer): use a uniform format dd-mm-yyyy across the entire file. If time is also needed add it as hh:mm:ss
Avoid adding colours in the sheet (for the love of God I do not want an excel sheet with 10 different colours).
It would be great to have input from the society, do you recommend different formats for variables names or contents? How do you communicate with those delivering the data to you to ensure that you do not end up with unnecessary file cleaning.
... View more