In preparations for SAS Global Forum 2019 I am putting together an article reference for my presentation as well as uploading the macro so others can try using it. This macro has been extremely helpful for myself in the various meta and pooled analyses that I work on. I receive data from a large number of oncology research groups and have to be able to see what's in it quickly, sometimes without a data dictionary. The DATA_SPECS macro summarizes all of the datasets and variables contained within a specific library and creates and Excel report in a data dictionary style format.
The field of clinical research often involves sharing data with other research groups and receiving data from other research groups. This creates the need to have a quick and concise way to summarize incoming or outgoing data that allows the user to get a grasp of the number of datasets, number of variables, and number of observations included in the library as well as the specifics of each variable within each dataset. The CONTENTS procedure can fulfill this role to an extent, but the DATA_SPECS macro uses the REPORT procedure along with the Excel Output Delivery System (ODS) destination to create a report that is fine tuned to summarize a library. The macro produces a one page overview of the datasets included in the specified library, and then creates a new worksheet for each dataset that lists all of the variables within that dataset along with their labels, formats, and a short distribution summary that varies depending on variable type. This gives the user an overview of the data that can be used in documents such as data dictionaries, and demonstrates an example of the powerful reports that can be generated with the ODS Excel destination.
The DATA_SPECS macro is designed to give the user a quick but informative look at a library of datasets. The generated report creates three different types of tables.
The first table is a simple listing of the datasets that are contained within the specified library. There is a column for the dataset name (which includes any dataset label in parentheses), number of observations, number of unique index values, and number of variables. Index variables are specified to the macro to try to determine unique subjects and can be variables such as patient ID. Knowing the number of observations vs the number of unique index values indicates to the user if the dataset has one row per subject or multiple observations per subject.
The second table will display any variables that exist across multiple datasets in the library. This information is important to see if any critical variables such as gender have different values for the same subject across datasets. The table lists the variable name, the list of datasets it is contained in, and all unique variable labels that variable has in those datasets. Different labels can indicate the variable has different meanings in the different datasets.
The macro creates one table within a worksheet for each dataset in the library which contains a summary of each variable within that dataset. The variables can be laid out in three different formats: long, condensed, and wide. The meta-data for each variable includes its name, label, format, length, type, and a short distribution that varies depending on the type of variable.
In this version only two columns are used in the worksheet. The first indicates which piece of meta-data on the variable is being listed, and the second has the corresponding value.
In this version there is one row for each variable, and each of the meta-data pieces has their own column.
In this version the variables are listed left to right instead of up to down. There is one row for each piece of meta-data and one column for every variable.
Numeric variables are summarized in one of two ways depending on the macro parameter CAT_THRESHOLD. If there are more unique values than the threshold specified by CAT_THRESHOLD then the variable is considered continuous and is summarized with the number of non-missing observations, number of missing observations, median and range. The format attached to the variable is used for the median and range to accommodate date type variables.
If there are less than or equal to the CAT_THRESHOLD unique items then the variable is considered discrete. Discrete variables are shown with a frequency list of each value in the variable. If the variable is formatted and the formatted value does not match the unformatted value, then the list is shown as UNFORMATTED (FORMATTED): N (%).
Character variables are always considered to be discrete variables by the macro, but the CAT_THRESHOLD parameter is again used to determine if the variable has too many unique levels to effectively be shown. If there are more levels than CAT_THRESHOLD then the macro simply lists how many missing and non-missing values there are.
The DATA_SPECS macro is simple to use and is very thoroughly developed for sharing. The macro has customizability with few parameters, full documentation, and full error checking on the parameters. The program will run without changing the user’s settings, leaves behind no temporary datasets, and does not put any output to the log or results windows.
The DATA_SPECS macro only has two required parameters and five optional parameters. The required parameters are the only options that require inputs when calling the macro as the optional parameters all have default settings.
The LIBN parameter refers to the libname of interest to be summarized by the macro. The library must exist and the libname must be established prior to running the macro.
The OUT parameter refers to the outputted Excel file that is created at the end of the macro that will contain the library summary. The file path should include the .XLSX at the end of the filename and requires the full path.
The INDEX parameter allows the user to specify any potential patient identifying variables that will be used by the macro to determine how many unique patients or subjects are within each dataset if the variable(s) are within the dataset. Multiple variables can be specified in two different ways. If there are multiple index variables, such as patient ID and study center ID, they can be specified in a list separated by spaces. The macro will search through the list and find all of the unique combinations of the listed variable as a new combined INDEX variable. The second scenario is that different datasets have different patient ID variables. This often happens when multiple data transfers happen from the same group, but by different programmers. This can be accommodated by the macro by separating lists with a | (pipe) symbol. The macro will search each dataset for the first of the separated lists that exists and use that to create the index variable.
The CAT_THRESHOLD parameter determines how many unique values a numeric variable can have before it is considered continuous. When the number of distinct values is less than the threshold then a frequency distribution will be shown in the summary. Otherwise continuous statistics such as median and range are shown in the summary.
The WHERE parameter allows the user to pass a WHERE statement into the dataset produced by the CONTENTS procedure within the macro. This can be useful if certain variables or datasets are not wanted within the summary.
The FORMAT parameter determines whether the variables within each dataset’s summary are shown in a long or wide format. The long format will list all of the variables vertically, and the wide format will list all of the variables horizontally.
The ORDER parameter determines whether the variables within each dataset are shown alphabetically or by the variable order within the original dataset.
The DATA_SPECS macro has full error checking programmed in order to improve the user experience. The macro will check the following list of issues:
If any issues are found the macro will do the following:
The DATA_SPECS macro is a powerful tool for quickly summarizing a library of datasets. The ability to quickly determine the number and size of datasets as well as have a quick glance at the content of the variables within each dataset will increase the efficiency of understanding and integrating new or shared data. The DATA_SPECS macro is also an excellent example of the versatile reports that can be created using the Excel output destination. Being able to create tables within separate tabs allows for quick navigation, and not having vertical or horizontal space limits allows for more information to be presented within a single table. The Excel output destination still has the useful options that TAGSETS.EXCELXP did, but with the benefits of smaller file sizes and more flexible customizations.
The following is an example using the SAS library SASHELP which is automatically created when opening SAS. Some datasets are excluded due the sheer number of variables. The outputted document is attached to this article.
%data_specs(LIBN=sashelp, OUT=data_specs_example.xlsx, INDEX=id, WHERE=memname ^in('LEUTEST' 'LEUTRAIN') and substr(memname,1,1)^='V');
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.