BookmarkSubscribeRSS Feed

Data Library Comparison Macro %COMPARE_ALL

Started ‎03-24-2020 by
Modified ‎07-14-2020 by
Views 9,139

This was a paper that was supposed to be presented at PharmaSUG 2020 this year, but due to the cancellation of the conference I still wanted to create a page with the paper and macro for download to share with everyone. This is a macro that has been extremely convenient whenever I need to compare frozen libraries of data together to get a report of what has changed.  It removes the tediousness of having to compare individual datasets and makes the output easier to read.

 

Abstract

Reproducible research and sharing of data with repositories are becoming more standard, and so the freezing of data for specific analyses is more crucial than ever before.  Maintaining multiple data freezes requires knowing what changed within the data from one version to another.  In SAS there is the COMPARE procedure that allows the user to compare two data sets to see potential new variables, lost variables, and changes in values.   Relying on the COMPARE procedure can be tedious and cumbersome when maintaining a database containing several data sets.  The COMPARE_ALL macro was written to ease this burden by generating a Microsoft Excel report of a comparison of two data libraries instead of just two data sets.   The report indicates any new or lost data sets, variables or observations and checks for changed data values within all variables.  Multiple ID variables can be specified and the macro will determine which variables are relevant with each data set for comparison.  The COMPARE_ALL macro is a fantastic tool for managing multiple versions of the same SAS database.

 

Introduction

The research field of clinical oncology has many reasons for keeping copies, or freezes, of study data including: standardized reporting such as to the Data Safety Monitoring Board (DSMB), sharing data to a repository such as Project Data Sphere, or when publishing the analysis in an abstract or manuscript.  Studies typically have multiple data freezes over several years, and keeping a log of the changes is difficult.  The COMPARE Procedure is useful for comparing one data set at a time, but studies often contain many data sets and using the COMPARE Procedure on each one is tedious.  The COMPARE_ALL macro was created to make an easy to read report in an Excel that compares entire SAS libraries to check for various changes such as new or lost data sets, new or lost variables, change to variable attributes, and change to variable values.  The generated report is a powerful tool for summarizing data changes and ensuring that the data changes are expected.

 

Report Features

The COMPARE_ALL macro produces an Excel report with four different types of tables.  The first is an overall summary of the data sets within the base and comparison libraries, the second is a summary of the variables within a specific data set as well as several variable attributes, the third is a summary of the types of data changes within each data set, and the last is a variable level change summary.

 

Table Type 1: Library Summary

This table is only listed once in the report and is always listed first.  The table lists each data set in either the base or comparison library in alphabetical order.  If a data set exists within both libraries several comparisons are highlighted:

  • Date data sets were last updated
  • Number of variables
  • Number of observations
  • Number of variable attributes that changed (e.g. length, type, label)
  • Number of lost observations (when an ID variable is specified)
  • Number of new observations (when an ID variable is specified)
  • Number of data changes (variable value changes for same observation/ID combination)

The table also indicates which ID variables in the ID list are available to be used in comparisons for the given data set.  The ID variables used can be different within each data set.

 

compare_all_figure1.jpg

 

The report automatically uses colors to highlight data changes and to reference the same items across tables.  Attributes for the base library will always be highlighted blue, the compare library will be grey, ID variables will be green, and data differences will be red.  In table 1 any changes between base and compare will be highlighted orange to draw the user’s attention immediately.  The header displays the file paths of both the base and comparison libraries so that the user can ensure the correct libraries are being compared.

 

Table Type 2: Data Set Level Summary

The macro report will include a data set level summary for each data set that exists within both the base and compare libraries.  Variables are listed in the same order as they appear in the data set by default, but can be listed alphabetically as well.  Each variable indicates if it is an ID variable, has the TYPE, LABEL, FORMAT, and LENGTH attributes listed as well as whether the variable is lost (in base but not in compare), new (in compare but not in base), and for how many observations did the variable’s values change for a given ID combination.

 

compare_all_figure2.jpg

 

Variable names or attributes are highlighted in three cases: if they are an ID variable (green), if an attribute such as typing has changed (orange) and if any of the “Any Differences” columns are not equal to “No” (orange) or if the number of data changes are greater than zero (red).  This draws the user’s attention to the variables and attributes that are of interest for the difference report.  The header lists the data set name and gives a list of the ID variables used for that particular data set.

 

Table Type 3: Data changes summary

The third table type is optional and displays a summary of the data changes either within a data set grouping by either the first N (specified by IDSUMTABLE parameter) ID variable(s) available or across all observations.  The summary is useful to see a quick description of all the changes, whether all of the changes are coming from the same study or patient, or if the changes are spread out.  This also gives a chance to see if related variables are also changing.  For example being able to immediately check if the number of patients having death dates has increased the same amount as the number of patients having a survival status changed from alive to dead.

 

compare_all_figure3.jpg

 

The above figure summarizes all of the variable changes within the first ID variable as well as displaying the number of lost observations, new observations, and data changes.  For a meta-study such as the one in this figure, this is valuable to see if the changes are coming from the expected studies or if a programming error could have caused the changes.  The summary for each variable changes depending on criteria:

  • Comparing two character variables: displays a cross tab of the before and after values with frequencies (see arm_strat_text)
  • Comparing two numeric variables having less than or equal to macro parameter CROSSTAB_THRESHOLD’s unique combinations: displays a cross tab of the before and after values with frequencies
  • Comparing two numeric variables having greater than the macro parameter CROSSTAB_THRESHOLD’s unique combinations: displays the number of times values changed from missing to non-missing, non-missing to missing, and non-missing to non-missing.  There will be a frequency and minimum and maximum change values where appropriate.

The header of the table indicates the data set name and ID variable used to be clear to the user.  It is also possible to specify IDSUMTABLE=0 and summarize across all observations:

compare_all_figure4.jpg

 

The above figure summarizes all of the variable changes, lost observations, new observations, and data changes across all observations of the data sets.  This is a quick way to summarize all variable level changes when there are many changes across numerous ID variable levels.

 

Table Type 4: Variable Change Summary

The fourth table summary is similar to the output given by the COMPARE procedure.  Each variable that has data changes will have its own worksheet to show individual observation changes.

 

compare_all_figure5.jpg

 

The listing will include each ID variable, observation number in the data set, the base data set value, the compare data set value, and the absolute and percent changes if both variables are numeric and non-missing.  The header of the table lists the data set, variable name and label.  The listing is straightforward and mimics the output of the COMPARE procedure.

 

Navigating the Report

The report has the potential to create a large amount of worksheets which can make it difficult to navigate quickly to the desired summary.  The COMPARE_ALL macro accounts for this by inserting hyperlinks into each table to allow the user to jump to the appropriate summary and back.  Examples of the hyperlinks are:

  • Within the Type 1 worksheet the user can click the name of any data set to jump to that data set’s Type 2 data set summary page.
  • Within each Type 2 data set summary page any variable that is marked red for data changes can be clicked on to jump to that variable’s type 4 variable listing.
  • Each Type 2, 3, and 4 worksheet includes a link in the header to either jump back to the Type 1 worksheet or to drill back up a level (e.g. Type 4 listing back to the Type 2 data set listing)

These hyperlinks allow the user to navigate the report much more efficiently.

 

Calling the COMPARE_ALL Macro

The COMPARE_ALL macro itself is straightforward available due to the small amount of parameters needed.  Using the macro requires access to ODS EXCEL within SAS.  There are three required parameters and four optional parameters:

  • Required parameters:
    • BASE: Library name to be considered the old data for comparisons.  Note that the libname must already be specified prior to the macro.
    • COMPARE: Library name to be considered the new data for comparisons.  Note that the libname must already be specified prior to the macro.
    • OUTDOC: The destination filepath and filename of the XLSX file that will contain the report
  • Optional parameters:
    • ID: A space delimited list of variable names to be used as ID variables.  Note that not all data sets in the same library have to contain all of the ID variables.  The macro will match any available ID variable in the list to a given data set in the order they are listed.  ID variables not in a given data set are ignored for that comparison.
    • IDSUMTABLE: Determines how many of the available ID variables are used to produce the Table 3 Summary
    • SELECT: Allows the user to specify which data sets in the library are included in the report.  The user can specify a space delimited list of the desired data sets.  If this option is used then a message is shown in the Type 1 worksheet to indicate that not all of the data sets in the libraries are being shown.  The data sets in the SELECT statement must exist in both the BASE and COMPARE libraries.
    • CROSSTAB_THRESHOLD: Determines the threshold for the number of unique variable value changes before summarizing the changes as non-missing and missing to save vertical space.  The value defaults to 15 and must be greater than or equal to 1.

The following is the macro call that leads to the images in figure 1 to figure 4 (file paths and study names have been masked for confidentiality):

libname live '/live_data/';
libname freeze '/frozen_data/';
options fmtsearch=(live work library);

%compare_all(
     base=freeze,
     compare=live,
     id=protnum dcntr_id merged_day visit study_day start_dt end_dt,
     outdoc=database_changes.xlsx,
     select=,
     idsumtable=1);

The libraries are predefined in LIBNAME statements.  The FMTSEARCH option is enabled in order to activate the formats of the library in order to make the values in the comparisons make sense.  The most complicated parameter is the ID option.  There are seven ID variables listed, but a majority of the data sets will only use the first two listed ID variables due to not having the other five.  There are several data sets in these libraries where there are multiple rows per patient and times such as MERGED_DAY and START_DT are necessary to compare the appropriate observations.  The macro will search for all seven listed ID variables in each data set and subset down to only the variables that exist in both the base and compare data sets.

 

Conclusion

The COMPARE_ALL macro is a powerful tool to efficiently compare two SAS libraries versus running multiple COMPARE Procedure calls for each data set in the libraries.  The generated report is clean and straightforward to read with built in navigation for ease of use.  The COMPARE_ALL macro will be useful for any programmer working with clinical trial data and is available to be shared.

 

Contact Information

I can be contacted for questions or issues with using the macro.  I'll be updating this page with the final paper and presentation when they are available.  I normally program in Linux so there is the potential for issues in Windows SAS that I do not normally encounter.

Name: Jeffrey Meyers

Enterprise: Mayo Clinic

E-mail: Meyers.jeffrey@mayo.edu / jpmeyers.spa@gmail.com

Comments

Thank you @JeffMeyers for sharing your work!

Very useful macro. Thank you.

Thank your @JeffMeyers  very useful macro.  I have couple of questions.

1. I would like to change the color of the text where it creates the  hyperlinks for example in " Top summary" sheet. in columns with with dataset name will hyperlink to other sheets in the file. I want to change the color of dataset to blue so we know its as hyperlink. I played around the first proc report column styles but it did not worked for me. Any ideas 

SASuserlot_0-1666719291958.png

 

2.  Is this macro through any notes/ warning to indicate if one of the library missing the dataset  lets say one have 3 datasets and another have 2? . I understand that 'select '  macro variable have the ability to select the datasets.

 

3.  Is it gives any indication or notes or warning about if 'compare' dataset missing the one of the variable in the ID macro variable compare to base dataset? Ex: lets say base have variables 'usubjid studyid siteid' but compare have ' usubjid studyid' only. If we call ID macro variable with 'usubjid studyid siteid', is it gives any notes or warning one of the dataset missing the values or it will just run with the common variables?

Thank you @JeffMeyers  for your macro, it's really powful.

I have one suggestion, if user use UTF-8 SAS,  he need change "A0"x  to 'C2A0'x in PROC REPORT, otherwise SAS will occur "ERROR: Invalid characters were present in the data."

reference link.

another helpful link. 

 

in my side, I just made some modification of your code .

 

   %local nbsp;
   data _null_;
   length nbsp $10;
     if "%SYSFUNC(GETOPTION(encoding))" = "%str(WLATIN1)" then nbsp='A0'x;
	 else if "%SYSFUNC(GETOPTION(encoding))" = "%str(UTF-8)" then nbsp='C2A0'x;
     call symput('nbsp',put(nbsp,$hex8.));
   run; 
 define n6 / across "&nbsp"x style={borderbottomstyle=none borderbottomcolor=white};
       define n5 / across "&nbsp"x style={borderbottomstyle=none borderbottomcolor=white background=lightgreen};

Hello @SASuserlot 

My apologies for not getting back to you sooner.  

1. I would like to change the color of the text where it creates the  hyperlinks for example in " Top summary" sheet. in columns with with dataset name will hyperlink to other sheets in the file. I want to change the color of dataset to blue so we know its as hyperlink. I played around the first proc report column styles but it did not worked for me. Any ideas 

--You can do this within a compute block.  There should be a number of CALL DEFINE statements that are assigning the URL for the hyperlinks.  If you go to where these are made and at the same time add something like CALL DEFINE('column name','style/merge','style={color=blue}'); it should color the font blue.  BACKGROUNDCOLOR will change the fill color.

 

2.  Is this macro through any notes/ warning to indicate if one of the library missing the dataset  lets say one have 3 datasets and another have 2? . I understand that 'select '  macro variable have the ability to select the datasets.

--If a dataset is in one library but not the other it would only be visible in the first sheet summary.  There should be dates for the dataset that exists and nothing for the one that doesn't

 

3.  Is it gives any indication or notes or warning about if 'compare' dataset missing the one of the variable in the ID macro variable compare to base dataset? Ex: lets say base have variables 'usubjid studyid siteid' but compare have ' usubjid studyid' only. If we call ID macro variable with 'usubjid studyid siteid', is it gives any notes or warning one of the dataset missing the values or it will just run with the common variables?

--There are no warnings or notes to indicate which ID variables exist or don't in each dataset.  These are only listed on the first page summary.  This is because each dataset can have a different combination of variables so there was no reason to add warnings in this case.

Hello @NovGetRight .  Thank you for the suggestion.  I have come across situations like this as well and am learning to account for them in future programs.

Thank you @JeffMeyers  for the macro.

Have an error occured and cannot figure out what is the problem:

Error.png

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I have downloaded the file and modified only the %local nbsp part suggested above.  Can you please check it.

Thanks in advance,

Janos

@JanosBecsi since the macro uses PROC and DATA steps, you cannot call it within PROC SQL (which you seem to have done).

@Kurt_Bremser  I called it separately, not in proc sql: 

call.png

 

Then why does your log show

else . end,

immediately after the macro call?

Hello @JanosBecsi ,

  There is a part of the macro that is updating numbers with proc SQL update statements using macro variable counts created in a datastep.  There are situations where no datasets are actually compared and so none of these numbers are created so the statement would look like:
case 

else 

Normally with datasets compared there is a when clause for each dataset.  When no datasets are compared it can't create any when clauses so it creates the syntax error that you're seeing.  I have a newer version that I've made to try to avoid this situation that I will try to upload soon.

 

@Kurt_Bremser The macro internally uses PROC SQL a lot, which is where the error is coming from after the macro is called.

Hello @JeffMeyers ,

 

Did you have time to correct the code? If so, please upload the new version.

 

Thanks,

Janos

Hello @JanosBecsi 

I have been having trouble uploading new versions of macros to the website.  This is where the fixes need to be done in the program if you want to modify yours:
It is basically just adding a when statement that will always be false so it'll never be picked, but if &ndata=0 the case syntax won't resolve in an error.

JeffMeyers_1-1707333094193.png

 

 



Version history
Last update:
‎07-14-2020 12:46 PM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags