BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ammarhm
Lapis Lazuli | Level 10

Hi all,

This is not really a ‘pure SAS’ question but hopefully some might find it interesting regardless. Moderators please feel free to delete if you feel it does not belong here. 

Many of us receive data to analyse in either an excel, csv or other format, which is then imported to SAS for analysis. The content of these files is usually messy (to say the least) and we end up spending some time cleaning the data file before importing it to SAS, which is an unnecessary waste of time (I hope I am not the only one having this problem?). It is quite common that untidy data result in problems in coding execution and sometimes even errors in the results of analysis. 

I thought it might therefore be good to create a document or a list of instructions to the person delivering the data to avoid this problem. I am hoping to draw from the wisdom of this society to make such instructions as complete and good as possible. Please feel free to change, edit or add suggestions:

  1. File format: should be either .csv or xlsx.
  2. Data sheets
    1. The first data sheet should be called ‘data’ and should contain the data (see below)
    2. The second datasheet should be called ‘dictionary’ and should contain the labels for the variables, e.g if there are two variables in the datasheet, gender(1,0) and colour(1,2,3) then the dictionary sheet should indicate that gender(1=female, 0=male), colour(1=red, 2=green, 3=blue) and so forth.
    3. The third datasheet should be called ‘analysis’, and should list all the required analysis and indicate which variables to use, such as:
      1.      Univariate analysis
        1. List of variables to analyse
        2. Strata variable
      2.      Logistic regression
        1. Dependant variable=’variable name’
        2. Independent variables to include =’list of variable’
  •      Survival analysis
    1. Time variable=’variable name’
    2. Entry time variable =’variable name’
    3. Exit variable=’variable name’
    4. Censor variable=’variable name’
    5. independent variables for adjustment=’variable list’

 

  1. Variable names:
    1. Should be short and concise (maximum of two words). A few months ago I received an excel file with some variables as long as seven words! Not fun writing a long variable name in code.
    2. All variable names should be low case. It is sometimes hard to keep track of combinations of capitals and small letters leading to errors in the execution.
    3. No spaces in variable names (I really hate this), ideally use underscore instead.
  2. Data contents:
    1. Do not mix numbers and text in the same variable
    2. Use 1 and 0 instead of yes/ no (thoughts?)
    3. Do not enter any text or symbols for missing values, leave the cells empty.
    4. Dates (this is another killer): use a uniform format dd-mm-yyyy across the entire file. If time is also needed add it as hh:mm:ss
  3. Avoid adding colours in the sheet (for the love of God I do not want an excel sheet with 10 different colours).

 

 

It would be great to have input from the society, do you recommend different formats for variables names or contents? How do you communicate with those delivering the data to you to ensure that you do not end up with unnecessary file cleaning.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

File format:

NEVER Excel. NEVER. Unless you like to wade in stinking manure. Excel files force you to use guessing tools on the SAS side, which also means cleaning up the **** afterwards. On top of that, serious limitations that can cause data loss (see my recent article in the Communities Library).

Instead:

  • CSV (can be separated with standard commas or "Excel" semicolons)
  • Fixed-column text; this is (e.g.) delivered by DB/2 unloads
  • tab-separated text
  • any other delimiter that makes sense (| - pipe character - comes to mind)

Variable names: have to follow SAS V7 standards, period. The only thing I tolerate are the # characters used in DB/2, as I can easily and consistently replace them with underlines. This means

  • max 32 characters; you don't want Vogon poetry in variable names
  • can contain digits, letters, underlines
  • start with either a letter or underline
  • NO blanks (therefore no "multiple words")

Good idea of documenting encodings (F=female, M=male), so it's easy to create formats for them; such encodings (if larger than a few entries, or dynamic) should be provided as separate tables for easy import into SAS

 

Dates: use ISO 8601 compliant dates/times/datetimes

  • dates must be provided in YYYY-MM-DD format
  • datetimes as YYYY-MM-DDTHH:MM:SS.ssssss (T can be replaced by blank, as the E8601DT informat provided by SAS tolerates this)

 

View solution in original post

2 REPLIES 2
Kurt_Bremser
Super User

File format:

NEVER Excel. NEVER. Unless you like to wade in stinking manure. Excel files force you to use guessing tools on the SAS side, which also means cleaning up the **** afterwards. On top of that, serious limitations that can cause data loss (see my recent article in the Communities Library).

Instead:

  • CSV (can be separated with standard commas or "Excel" semicolons)
  • Fixed-column text; this is (e.g.) delivered by DB/2 unloads
  • tab-separated text
  • any other delimiter that makes sense (| - pipe character - comes to mind)

Variable names: have to follow SAS V7 standards, period. The only thing I tolerate are the # characters used in DB/2, as I can easily and consistently replace them with underlines. This means

  • max 32 characters; you don't want Vogon poetry in variable names
  • can contain digits, letters, underlines
  • start with either a letter or underline
  • NO blanks (therefore no "multiple words")

Good idea of documenting encodings (F=female, M=male), so it's easy to create formats for them; such encodings (if larger than a few entries, or dynamic) should be provided as separate tables for easy import into SAS

 

Dates: use ISO 8601 compliant dates/times/datetimes

  • dates must be provided in YYYY-MM-DD format
  • datetimes as YYYY-MM-DDTHH:MM:SS.ssssss (T can be replaced by blank, as the E8601DT informat provided by SAS tolerates this)

 

ballardw
Super User

In addition to @Kurt_Bremser about documentation:

 

The ORDER of the variables in the source data set needs to stay the same in all files.

 

 

For character variables the maximum length that will be used/processed.

If at all possible try to arrange something to avoid use of " or ' characters in the body of text. Example: 8"x 10" (here these are measurements of an object in inches) , or Container "B"   . The embedded quotes can cause serious headaches and are often not needed such as the quotes around the B.

 

The number of header rows (if any) should be agreed on. Yes SAS can read data starting from row X. But you shouldn't have to change that row from file to file.

 

NO summary rows (though this is more typical with Excel some people will send you a "report" type file with such).

 

If a value for data not present is to be included in the file, agree what it should be and use only one. Nothing more fun than tracing down mixtures of UNK, unknown, UK, N/A, NULL,

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1088 views
  • 1 like
  • 3 in conversation