3 common messy data problems and how to tidy them in SAS

7 Likes

You have messy data, data which cannot be analyzed in SAS® Enterprise Miner™ due to its structure. How do you manipulate and restructure this data to make it tidy? How do you bring it into a form to which analytical models in SAS Enterprise Miner can be applied successfully? Hadley Wickham describes in great detail the attributes of "tidy data" and the process for making messy data tidy (Wickham, 2014). In short, a tidy data set has the following two attributes:

Each column forms a variable
Each row forms an observation

In a messy data set, typically a measure variable will be broken out by one or more dimension values into multiple measure columns, or one column will include two or more dimensions.

In this tip, I discuss three common messy data problems described by Wickham, and introduce the SAS code that is used to tidy this data. I also include an example messy data set and how this SAS code is used to transform and make this data set tidy. With this tip, I address Sections 3.1, 3.2, and 3.3 from Wickham’s article.

Messy Data Scenario 1 – Dimension values stored as column names

This data is stored in a presentation/data collection style format, where multiple Measure columns are described by the Dimension values in the column names. In this example, Measure column names W, X, Y, and Z comprise the values for a Dimension variable:

To make this data tidy, I melt, or stack, the dimension measure columns into one measure variable, and generate the dimension variable (Dim):

SAS Code used to accomplish this restructuring:

/*Generate the _CASE_ column. Clear-out any labels on the measure columns.*/
Data _temp_indata /view=_temp_indata;
 Set indata1;
 _CASE_=_n_;
 Label W=;
 Label X=;
 Label Y=;
 Label Z=;
Run;

/*Transpose columns W,X,Y,Z to generate the MEASURE and DIM columns*/
Proc transpose data=_temp_indata out=outdata (rename=(col1=MEASURE) drop=_CASE_) name=DIM;
 By _CASE_ YEAR NOTSORTED;
 Var W X Y Z;
Run;

Messy Data Scenario 2 – Multiple dimension variables stored in one column

In the below data example, a column contains values for multiple Dimension variables: X and Y in column Dim are the values for one dimension, and 1 and 2 are the values for another dimension.

To make this data tidy, I parse the Dim column to generate Variables Dim1 and Dim2:

SAS Code used to accomplish this restructuring:

/*Parse DIM to extract DIM1 and DIM2.*/
Data outdata;
 Set indata;
 Length DIM1 $1 DIM2 $1;
 DIM1=ksubstr(DIM,1,1);
 DIM2=ksubstr(DIM,2,1);
 Drop DIM;
Run;

Messy Data Scenario 3 – Dimension values stored as column names, and Measure variables stored in rows

This is data stored in a presentation/data collection style format, where multiple Measure columns are described by the Dimension values in their Names, and multiple Measure variables are stored in rows. In this example, Measure column names W, X, Y, and Z comprise the values of a Dimension variable, and Measure variables A and B are stored in rows. Column Type contains the names of the Measure variables (A and B):

To make this data tidy, I melt, or stack, the data to generate the dimension variable (Dim), and Measure variables A and B are rotated out as columns:

SAS Code used to accomplish this restructuring:

/*Clear-out any labels on the W,X,Y,Z measure columns.*/
Data _temp_indata3;
 Set indata;
 Label W=;
 Label X=;
 Label Y=;
 Label Z=;
Run;

/*Sort by the Identifier variables.*/
Proc sort data=_temp_indata3;
 By year;
Run;

/*Transpose the W,X,Y,Z measure columns to generate the A and B measure variables 
  and the dimension variable (DIM).*/
Proc transpose data=_temp_indata3 out=outdata name=DIM;
 By year;
 Var W X Y Z;
 Id type;
Run;

How can I run one of these programs myself?

The above described messy data problems are addressed specifically by three SAS programs:

1) tidy1.sas (Messy Data Scenario 1)
2) tidy2.sas (Messy Data Scenario 2)
3) tidy3.sas (Messy Data Scenario 3)

These SAS programs are housed in, and can be accessed from, the following GitHub repository created by Patrick Hall:

Download the Files (GitHub)

Before running one of these programs to address a messy data scenario, the program must be edited to provide property settings that are specific to the User’s data. I provide a data example below to illustrate the usage of two of these programs.

My mocked-up data for this example contains the number of cases for two strains of a disease aggregated by year and by two dimensions: Sex and Ethnicity.

The STRAIN1 and STRAIN2 measure variables are broken out into multiple measure columns by the Sex and Ethnicity dimensional values. The names of these measure columns contain the Sex and Ethnicity values separated by a colon. For example: In 2000 there were nine Female Native American (f:n) persons who had Strain 2 of the disease.

Both Messy Data Scenarios 2 and 3 are incorporated in this data set. To restructure this data set and make it tidy, program tidy3.sas needs to be run, and then tidy2.sas needs to be run.

First, tidy3.sas is edited to make the following property settings that are specific to this data:

/*Name of the input dataset.  Must include the Libref if the dataset is not in WORK.*/
%let indata3=tidy.mock_strain;

/*Name of the generated output dataset.  Include a Libref as needed.  Must conform to
  SAS Name rules.  If blank, WORK._TIDY3_ will be generated.*/
%let outdata3=tidy.mockstrain_tidy3;

/*Character column whose values are the names of the Measure Variables.  Specify one column
  name within the single quotes.  Required.*/
/*This column must be populated for every row.*/
%let measNamesCol='strain';

/*Identifier (Fixed) Variables*/
/*Specify these variables by inserting them on lines immediately after the datalines4
  statement below, one per line.*/
/*Do not include the column specified for &measNamesCol (above) in this list.*/
Data idvars;
input @1 name $char32.;
datalines4;
year
;;;;
Run;

Next, tidy3.sas is run. Output data set MOCKSTRAIN_TIDY3 is now restructured such that it contains the _DIM_ column and the STRAIN1 and STRAIN2 measure variable columns. _DIM_ includes the Sex and Ethnicity dimension values.

Now tidy2.sas is used to break-out Sex and Ethnicity into two columns. First, tidy2.sas is edited to make the following property settings:

/*Name of the input dataset.  Must include the Libref if the dataset is not in WORK.
  Required.*/
%let indata2=tidy.mockstrain_tidy3;

/*Name of the generated output dataset.  Include a Libref as needed.  If blank, WORK._TIDY2_
  will be generated.*/
%let outdata2=tidy.mockstrain_tidy2;

/*Name of the column containing values for the dimension variables.  Specify the Column name
  within the single quotes.*/
%let dimcol='_DIM_';

/*Space or comma delimited list of the names of the Dimension Variables.  The specified Names
  must conform to SAS Name rules.*/
%let dimnames=SEX ETHNICITY;

/*Space or comma delimited list of the character lengths of the Dimension Variables.  Specify
  integer values.*/
/*The number of values specified must match the number of values specified for &dimnames
  above.*/
%let dimlengths=1 1;

/*Method for how the Dimension variable values are stored in the Column.  Required.*/
/*1=Delimited by one or more specified characters.  
  2=Fixed Start Character position, counting from the left.  The Start position for a
    Dimension variable needs to be the same on every Column value.*/
%let dimstoremethod=1;  /*Valid value: 1 or 2.*/

/*Delimiter character list for parsing the column values. Specify the characters, including
  a space as needed, within the single quotes.*/
/*Used if &dimstoremethod=1.*/
/*The first delimited value is assigned to the first Dimension Variable in &dimnames,
  the second delimited value is assigned to the second Dimension Variable in &dimnames, etc.*/
/*If no characters are specified, the following default list of characters will be used:
  blank . < ( + & ! $ * ) ; ^ – / , % |
*/
%let parsedelim=':';

Next, tidy2.sas is run. Output table MOCKSTRAIN_TIDY2 now contains columns SEX and ETHNICITY, which have been generated by parsing the _DIM_ values. This data set is now normalized and tidy. Analytical and statistical models can now be applied successfully to this data set.

I’ve discussed the attributes of tidy data, and shown three typical scenarios that make data messy. Additionally, I’ve presented the SAS code that is used to address the three messy data scenarios, and presented a messy data example along with the process that’s used to run the SAS code against that data to make it tidy.