Cutting-Edge Techniques and SAS® Macros to Clean, Prepare and Manage Data

1 Like

Paper 1177-2021

Authors

Kim Chantala, RTI International

Abstract

This presentation provides an overview of a new SAS book, The Data Detectives Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare and Manage Data, which helps you keep data preparation on schedule and within budget. You will discover efficient and low-cost ways to automate creating project documentation, such as codebooks describing your variables, catalogues describing your data sets and crosswalks showing the relationship of variables across data sets. These macros also automatically create quality control reports identifying the anomalies in your data, including out-of-range values, incomplete format definitions, validation of skip patterns, variables with no variation in their value, variables missing labels or assigned formats. The only requirement for producing these documents and reports is having SAS data sets with formats and labels. Traditionally, these documents are produced at the end of the project with a great deal of programming and manual effort, but these tools allow the programmer to seamlessly create these documents and reports at any time during the data preparation task. Producing these documents early in data collection improves data quality, as well as communication between the data collection team and client. The SAS macros covered in this presentation are provided as open source code and can be downloaded from the SAS author page for this book.

Watch the presentation

Watch Cutting-Edge Techniques and SAS® Macros to Clean, Prepare and Manage Data as presented by the author on the SAS Users YouTube channel.

Introduction

You will find the right data tools in my book, The Data Detective’s Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data, for creating project data that is ready for exploration and analysis. Using these tools will reduce the amount of time needed to clean, edit, validate, and document your data. Advantages of using the techniques in my book include:

Accomplishing more while doing less by automating and modernizing the typical data preparation activities
Beginning at the end by creating research-ready data sets and documentation early in the project with continual updates and improvements throughout collection and preparation
Keeping the sponsor or lead research investigators engaged by providing codebooks, crosswalks, and data catalogs for review early in the project, thus including them as part of quality control surveillance for the data

My book includes a set of SAS macro programs that automate many of the labor-intensive tasks that you perform during data preparation. Using these macro programs will help guard against compromising quality control and documentation efforts due to rigid project budgets and timelines. You will be able to automate producing codebooks, crosswalks, and data catalogs. Innovative logic built into these macro programs computerizes monitoring the quality of your data using information from the formats and labels created for the variables in your data set. You will receive concise reports identifying invalid data – such as out of range values, missing data, redundant, or contradictory data.

You only need to create a SAS data set with labels and formats assigned to each variable to use these macro programs. Not only are these SAS macros and techniques you learn to use helpful for preparing your own research-friendly data sets, but they can be used on data sets you receive that are not well-documented.

It could not be easier or faster to create data that you can trust. The SAS macro programs accompanying my book are available at no charge and can be downloaded from my author page at http://support.sas.com/publishing/authors.

In my book you will learn how to use these macro programs to make your job easier and create higher quality data. This paper introduces you to the macro programs accompanying my book and highlights how they can help solve many of the problems that you face in data preparation.

AN OVERVIEW OF THE DATA DETECTIVE'S TOOLKIT

Data preparation is a heroic task, often with inconsistencies and anomalies in raw data that you must resolve to make the data usable. Your job will include:

Investigating unexpected or missing values
Resolving conflicting information across variables
Mitigating incorrect flow through skip patterns
Examining incomplete data
Combining multiple data sets with different attributes
Documenting changes in data collection methods or instruments during collection

Reconciling these issues requires careful investigation and alleviation during data cleaning and preparation. Rapid advancement in software for both data collection and analysis has encouraged more complex data to be collected. This has caused greater challenges for you as the programmer responsible for turning it into high-quality, research-friendly data. Advances in software to help you solve these issues has progressed at a slower pace than advances in software for analysis or collecting data. This lag in development of computerized tools for data preparation has motivated the development of the macro programs included with my book.

These macro programs have been developed to help you work more efficiently when preparing data and automate much of the tedious work in identifying and correcting problems in your data. Table 1 lists the macro programs provided with my book and what they will do for you.

The only requirement for using these data tools is creating SAS data sets with formats and labels assigned to each variable. Once you have the SAS data set created, you will only need a simple line of SAS code to invoke each of the data tools. The first three macro programs listed in Table 1 create useful documentation for your data sets. You can create them at the beginning of the project and benefit by having them available for everyone in your team.

%TK_CODEBOOK

The first tool, %TK_codebook, creates a codebook. This macro uses one statement requiring only that you provide macro parameters defining the following information about your SAS data set and codebook:

Libref identifying the location of your SAS data set (lib=)
Name of your SAS data set (file1=)
Library for the formats assigned to the variable (fmtlib=)
Type of file to create for storing the codebook (cb_type=)
Name for your codebook (cb_file=)

These five macro parameters are required and are shown in the SAS code shown below:

Title “Master Codebook for Study A Preliminary Data”;
%TK_codebook(lib = work,
        file1 = StudyA_prelim,
        fmtlib = library,
        cb_type = XLSX,
        cb_file = &WorkFolder./Test_CodeBook.xlsx,
        var_order = internal,
        cb_output = my_codebook,
        cb_size = BRIEF,
        organization = One record per CASEID,
        include_warn = YES);

The last five macro parameters can be used to do the following:

Control the ordering of variables in the codebook (var_order=)
Obtain an output data set containing variables for the information listed in the codebook (cb_output=)
Display brief or full information on variables (cb_size=)
Show the organization of the data set (organization=)
Create reports highlighting potential problems in the data (include_warn=)

The %TK_codebook program creates a magnificent codebook as shown in Table 2 below.

You would need to examine output from PROC CONTENTS, PROC FREQ, and PROC FORMAT to get all this information about your data set.

But the best feature is yet to come! %TK_codebook will also examine each variable and print informative reports about potential problems. These reports showing the potential problems will be printed after the codebook. Using information from the label and format assigned to each variable, the %TK_codebook macro warns your data team about variables having problems shown in Tables 3 through Table 7.

On the call to %TK_codebook, we specified INCLUDE_WARN=YES which produces these reports. This is the default for %TK_codebook. If you want to suppress these reports, then use INCLUDE_WARN=NO.

For each variable automatically examined in these reports, you would have to write several SAS statements and examine multiple tables to figure out which variables need further examination. If your data set has 1000 variables, you will write SAS statements to create over 2000 tables, examine each table manually to identify problems, then summarize the problems that need investigation. With the reports from %TK_codebook, you are presented with a concise summary of only those variables needing close examination and why they need examination. You will spend your time correcting problems rather than writing repetitive SAS code and examining piles of SAS output. Chapter 3 teaches you how to use %TK_codebook to create a codebook and potential problem reports.

If you would like your codebook to have a different appearance than the codebook shown in Table 2, you can obtain an output data set from %TK_codebook and use SAS to customize the appearance. Chapter 4 teaches you how to obtain an output data set from %TK_codebook to customize the appearance of your codebook and how to add additional information about variables to the data set used to create your codebook.

%TK_INVENTORY

A catalog of all the SAS data sets for your project can be created at any time during the data life cycle with macro %TK_inventory by simply providing the full path name of the folder where your data sets reside in a libname statement, then passing the libref you specified as a macro parameter:

libname SAS_data "/Data_Detective/Book/SAS_Datasets";
%TK_inventory(libref=SAS_data);

The inventory catalog shown in Table 8 provides a concise summary of the data sets and where they are located, providing an ideal document for communicating a listing of available data. It makes it easier for you and your team to track the progression of developing your data sets. Chapter 5 teaches you how to use the %TK_inventory macro tool.

%TK_XWALK

The %TK_xwalk tool creates a data crosswalk to help you identify equivalent variables in multiple data sets as well as differences in the attributes of variables having the same name in more than one data set. You only need to use one short statement with a list of data files for %TK_xwalk to create your crosswalk as shown below:

%TK_xwalk(SetList = SAS_Data.studya SAS_Data.demography SAS_Data.health);

This statement creates a mapping of variables across two or more distinct data sets. Reviewing the crosswalk will help you identify variables with the same name that can be potentially used to merge the data sets as well as avoid truncating values when merging or concatenating data sets. You get a comprehensive list of ALL variables and their attributes in a group of data sets. Table 9 shows a crosswalk created by %TK_crosswalk for three data sets. Inspecting the table you will find:

The first column on the left contains the names of variables in all three data sets.
The row labelled Total indicates there is a total of 25 variables in the three data sets, with 8 in the demography data set, 5 in the health data set, and 12 in the study A data set.
An entry of 1 in any of the last three columns under the header “Data Set” indicates that the variable is present in the data set. Note that the only variable present in all three data sets is CASEID.
If the label, type, or format differ for a variable that is in multiple sets, then a cell is split so that all values occurring in your list of data sets will be displayed.

The anomalies in the data that you will discover are:

The CHG_WEIGHT variable has a FORMAT assigned in the STUDYA.sas7bdat data set, but not in the HEALTH.sas7bdat data set.
The same is also true for the WEIGHT variable.
The HEALTH variable has differences in three attributes in the HEALTH data set compared to the STUDYA data set as follows:
- Labels differ in the two data sets
- Data types differ – numeric in HEALTH vs character in STUDYA
- A FORMAT assigned in HEALTH data set, but not in STUDYA

You will learn to use %TK_xwalk in Chapter 5.

%TK_Find_Dups

You will need to examine each data set verifying that variables uniquely identifying an observation occur only on one observation. You will need to do this on every data set that is created, possibly each time changes are made to the program creating your data set.

With just a few strokes of the keyboard %TK_find_dups will easily do this for you:

%TK_find_dups(dataset=work.STUDY, one_rec_per=CASEID*WAVE, dup_output=STUDY_DUPS);

%TK_find_dups summarizes the results of the inspection for duplicates as shown in Table 10. This table shows you that there are:

Ten unique values of CASEID*WAVE in the 13 observations in data set STUDY
Eight observations that have unique values of CASEID*WAVE (COPIES=1)
Two observations that have identical values of CASEID*WAVE (COPIES=2)
Three observations that have identical values of CASEID*WAVE (COPIES=3)

Table 11 shows the actual values of CASEID*WAVE occurring multiple times in the data set. You will find:

CASEID=200 has three observations with values of WAVE=1
CASEID=400 has two observations with values of WAVE=2

An output data set can optionally be requested with the dup_output= option. It will contain values of duplicated identification variables from Table 11 that you can use to extract the duplicated observations from your data set. Chapter 6 from my book will show you the SAS statements you need to combine this output data set with your original data set to create Tables 12 and 13 showing the values of CASEID*WAVE that occur multiple times in your data set. This will allow you to examine the observations that have duplicate values of the identification variables in detail.

%TK_HARMONY

The %TK_harmony macro can identify possible problems with merging or concatenating two data sets. It is very simple to use, requiring only one statement providing the names of the data sets being harmonized, and nicknames for each data set used to annotate the Harmony report created by %TK_harmony:

%TK_harmony(set1= SAS_data.demography_a1,
	set1_id=Web,
	set2= SAS_data.demography_a2,
	set2_id=Paper,
	out=harmony_results);

Two tables are created by %TK_harmony showing the results of the examination. Table 14 is an overall summary report showing the Harmony measures of the variables that come from the two data sets we have assigned the nicknames of WEB and PAPER.

Table 14 shows that combining the two data sets will create a new set having 9 variables with unique names. %TK_harmony creates three measures of harmony for the variables in the data set. These Harmony measures indicate the following:

SOLO shows two of the variables are found in only one of the data sets.
SAME shows four variables are found having the same name and label in both sets.
DIFF shows 3 variables are found having the same name, but different data types or variable labels.

The Detail Report shown in Table 15 focuses on the variables with different attributes identified by harmony measure equal DIFF or SOLO.

Examining Table 15 shows:

AGE and SEX have different labels in the two data sets
MODE has character data types with different storage lengths (CHAR 3 vs CHAR 4) in the two data sets
The Web data set has a variable C-T-I-Y which is likely to be a misspelling of the variable CITY. We can also see that the character data types have different storage lengths for CITY and C-T-I-Y in the two sets.

In addition to learning how to use %TK_harmony in chapter 6 of my book, you will learn the right SAS statements to resolve differences identified by %TK_harmony when combining the data sets.

%TK_SKIP_EDIT

Skip patterns are used in data collection to ensure that only relevant questions are asked each person participating in the survey. For example, your study might have a set of questions that are asked only of female participants. Male participants would have missing values for all of these questions.

The %TK_skip_edit macro can be used to validate skip patterns as follows:

Validate that a variable follows the expected pattern of nonmissing/missing values when the variable is part of the skip pattern logic
Handle special recoding to correct inconsistencies in skip patterns and help users understand why a variable is missing

For example, suppose question PG1 asks women the number of pregnancies they have had in their lifetime. This would not be asked if the participant was male.

Question DEM2 shown in Figure 1 asks each participant their sex (1=female, 2=male). %TK_skip_edit uses this information to examine this skip pattern for you and change the value of PG1 to missing if a male responded to that question. You only need to set up a format identifying the values of a variable that cause a SKIP, and then pass this information to TK_skip_edit:

proc format;
    value SKIP2f 2='2=SKIP';
run;

%TK_skip_edit(check_var = PG1,
    skip_vars = DEM2,
    skip_fmts = DEM2 skip2f.,
    strata_fmt = svy_mode svy_mode.);

Information you pass to %TK_skip_edit includes:

The name of the variable assigned to the question being checked for consistent flow through the skip pattern. (check_var = )
The name of SKIP variables controlling the conditions for skipping the question being checked (skip_vars = )
The name of the format identifying the value of the skip variable causing the question to be skipped (skip_fmts = )
Optionally, the name of the stratification variable and assigned format (strata_fmt=)

Because this survey was administered by either paper questionnaires mailed to participants or a web-based questionnaire, you can ask %TK_skip_edit to stratify your data by a variable defining the strata groups (strata_fmt=). This allows the inspection and editing of your data to be done separately in each group.

%TK_skip_edit produces an annotated table reporting results from analyzing data flow through the skip pattern and any edits that were made to the data to resolve inconsistencies in the data flow. Table 16 shows these results for variable PG1.

You will learn more about skip patterns and how to use the %TK_skip_edit macro in Chapter 7.

%TK_MAX_LENGTH

SAS prints the following message in your log file to warn you that there is a mismatch in the storage length of variables in the data sets being combined in a DATA step:

WARNING: Multiple lengths were specified for the variable VAR_NAME by input data set(s). This can cause 
truncation of data.

When you see this message, it means that the values stored in VAR_NAME were possibly truncated when combining the data sets with a MERGE or SET statement. To prevent this from happening, you can use the %TK_max_length macro to create a macro variable named &MAX_LENGTHS that contains information about the variables common to two data sets but have different storage lengths. This list includes the name and the longest defined length of each variable. Macro variable &MAX_LENGTHS can be used in the LENGTH statement in the DATA step to prevent truncation of data values when two data sets are combined. The SAS statements below show how easy it is to use %TK_max_length and a LENGTH statement to prevent truncating data values:

%TK_max_length(set1=My_Data.teleform_data, set2=My_data.web_data);

data survey_v2;
length &max_lengths;
set My_Data.teleform_data My_Data.web_data;
run;

You will learn more about using %TK_max_length in Chapter 2.

CONCLUSION

This paper explained the benefits of using the Data Detective’s Toolkit for data cleaning, preparation, and management. Using these macro programs reduces the time needed to prepare data that you can trust. You will automate creating documentation for your data by easily creating codebooks, crosswalks, and data catalogs with just a few strokes on the keyboard. The way you clean data will be modernized enabling you to easily detect, investigate, and correct inaccurate data values in your data set.

The strength of using these macro programs to automate cleaning data and creating documentation lies in their general applicability and simplicity of use. The only requirement for you to use them is having a SAS data set with labels and formats assigned to the variables.

Incorporating the Data Detective’s Toolkit in your data preparation activities is easy and will improve the bottom line for you and your clients. Instead of spending your time preparing your data, you will be using your data to change the world!

E-books can be purchased from my SAS web page (http://support.sas.com/publishing/authors) and printed copies are available from on-line book sellers such as Amazon or Barnes and Noble.

REFERENCES

Chantala, K. 2020. The Data Detective’s Toolkit: Cutting-Edge Techniques and SAS^© Macros to Clean, Prepare, and Manage Data. Cary, NC, USA: SAS Institute Inc.

Terry, James and Chantala, Kim 2010. PROC_CODEBOOK, Automating the Review and documentation of SAS files, SESUG 2010, analytics.ncsu.edu/sesug/2010/BB14.Terry.pdf

CONTACT INFORMATON

Your comments and questions are valued and encouraged. Contact the author at:

Kim Chantala

RTI International

dchantala@rti.org

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.