Library Datasets Summary Macro %DATA_SPECS

13 Likes

In preparations for SAS Global Forum 2019 I am putting together an article reference for my presentation as well as uploading the macro so others can try using it. This macro has been extremely helpful for myself in the various meta and pooled analyses that I work on. I receive data from a large number of oncology research groups and have to be able to see what's in it quickly, sometimes without a data dictionary. The DATA_SPECS macro summarizes all of the datasets and variables contained within a specific library and creates and Excel report in a data dictionary style format.

Abstract

The field of clinical research often involves sharing data with other research groups and receiving data from other research groups. This creates the need to have a quick and concise way to summarize incoming or outgoing data that allows the user to get a grasp of the number of datasets, number of variables, and number of observations included in the library as well as the specifics of each variable within each dataset. The CONTENTS procedure can fulfill this role to an extent, but the DATA_SPECS macro uses the REPORT procedure along with the Excel Output Delivery System (ODS) destination to create a report that is fine tuned to summarize a library. The macro produces a one page overview of the datasets included in the specified library, and then creates a new worksheet for each dataset that lists all of the variables within that dataset along with their labels, formats, and a short distribution summary that varies depending on variable type. This gives the user an overview of the data that can be used in documents such as data dictionaries, and demonstrates an example of the powerful reports that can be generated with the ODS Excel destination.

What does the macro do?

Summarize a Library of Datasets

The DATA_SPECS macro is designed to give the user a quick but informative look at a library of datasets. The generated report creates three different types of tables.

Table 1: Summary of datasets

The first table is a simple listing of the datasets that are contained within the specified library. There is a column for the dataset name (which includes any dataset label in parentheses), number of observations, number of unique index values, and number of variables. Index variables are specified to the macro to try to determine unique subjects and can be variables such as patient ID. Knowing the number of observations vs the number of unique index values indicates to the user if the dataset has one row per subject or multiple observations per subject.

Table 2: Summary of variables that exist in multiple datasets

The second table will display any variables that exist across multiple datasets in the library. This information is important to see if any critical variables such as gender have different values for the same subject across datasets. The table lists the variable name, the list of datasets it is contained in, and all unique variable labels that variable has in those datasets. Different labels can indicate the variable has different meanings in the different datasets.

Table 3: Summary of variables within each dataset

The macro creates one table within a worksheet for each dataset in the library which contains a summary of each variable within that dataset. The variables can be laid out in three different formats: long, condensed, and wide. The meta-data for each variable includes its name, label, format, length, type, and a short distribution that varies depending on the type of variable.

Long format version:

In this version only two columns are used in the worksheet. The first indicates which piece of meta-data on the variable is being listed, and the second has the corresponding value.

Condensed format version:

In this version there is one row for each variable, and each of the meta-data pieces has their own column.

Wide format version:

In this version the variables are listed left to right instead of up to down. There is one row for each piece of meta-data and one column for every variable.

Numeric Variables:

Numeric variables are summarized in one of two ways depending on the macro parameter CAT_THRESHOLD. If there are more unique values than the threshold specified by CAT_THRESHOLD then the variable is considered continuous and is summarized with the number of non-missing observations, number of missing observations, median and range. The format attached to the variable is used for the median and range to accommodate date type variables.

If there are less than or equal to the CAT_THRESHOLD unique items then the variable is considered discrete. Discrete variables are shown with a frequency list of each value in the variable. If the variable is formatted and the formatted value does not match the unformatted value, then the list is shown as UNFORMATTED (FORMATTED): N (%).

Character Variables:

Character variables are always considered to be discrete variables by the macro, but the CAT_THRESHOLD parameter is again used to determine if the variable has too many unique levels to effectively be shown. If there are more levels than CAT_THRESHOLD then the macro simply lists how many missing and non-missing values there are.

How to use the macro?

The DATA_SPECS macro is simple to use and is very thoroughly developed for sharing. The macro has customizability with few parameters, full documentation, and full error checking on the parameters. The program will run without changing the user’s settings, leaves behind no temporary datasets, and does not put any output to the log or results windows.

Macro Parameters

The DATA_SPECS macro only has two required parameters and five optional parameters. The required parameters are the only options that require inputs when calling the macro as the optional parameters all have default settings.

Required Parameters

LIBN

The LIBN parameter refers to the libname of interest to be summarized by the macro. The library must exist and the libname must be established prior to running the macro.

OUT

The OUT parameter refers to the outputted Excel file that is created at the end of the macro that will contain the library summary. The file path should include the .XLSX at the end of the filename and requires the full path.

Optional Parameters

INDEX

The INDEX parameter allows the user to specify any potential patient identifying variables that will be used by the macro to determine how many unique patients or subjects are within each dataset if the variable(s) are within the dataset. Multiple variables can be specified in two different ways. If there are multiple index variables, such as patient ID and study center ID, they can be specified in a list separated by spaces. The macro will search through the list and find all of the unique combinations of the listed variable as a new combined INDEX variable. The second scenario is that different datasets have different patient ID variables. This often happens when multiple data transfers happen from the same group, but by different programmers. This can be accommodated by the macro by separating lists with a | (pipe) symbol. The macro will search each dataset for the first of the separated lists that exists and use that to create the index variable.

CAT_THRESHOLD

The CAT_THRESHOLD parameter determines how many unique values a numeric variable can have before it is considered continuous. When the number of distinct values is less than the threshold then a frequency distribution will be shown in the summary. Otherwise continuous statistics such as median and range are shown in the summary.

WHERE

The WHERE parameter allows the user to pass a WHERE statement into the dataset produced by the CONTENTS procedure within the macro. This can be useful if certain variables or datasets are not wanted within the summary.

FORMAT

The FORMAT parameter determines whether the variables within each dataset’s summary are shown in a long or wide format. The long format will list all of the variables vertically, and the wide format will list all of the variables horizontally.

ORDER

The ORDER parameter determines whether the variables within each dataset are shown alphabetically or by the variable order within the original dataset.

Error checking

The DATA_SPECS macro has full error checking programmed in order to improve the user experience. The macro will check the following list of issues:

If the library exists and has been assigned
If the OUT and LIBN parameters are missing in the macro call or set to null
If the CAT_THRESHOLD parameter is not a number greater than 0
If the FORMAT parameter is not set to a value of LONG or WIDE
If the ORDER parameter is not set to a value of VARNUM or ALPHA
If the current session’s SAS version is not at least 9.4 or greater

If any issues are found the macro will do the following:

Stop the macro and restore any changed options
Send an error message to the log that describes the issue and gives potential solutions. For example if using the wrong value for the FORMAT parameter the macro will provide the list of acceptable values in the log.

Conclusion

The DATA_SPECS macro is a powerful tool for quickly summarizing a library of datasets. The ability to quickly determine the number and size of datasets as well as have a quick glance at the content of the variables within each dataset will increase the efficiency of understanding and integrating new or shared data. The DATA_SPECS macro is also an excellent example of the versatile reports that can be created using the Excel output destination. Being able to create tables within separate tabs allows for quick navigation, and not having vertical or horizontal space limits allows for more information to be presented within a single table. The Excel output destination still has the useful options that TAGSETS.EXCELXP did, but with the benefits of smaller file sizes and more flexible customizations.

Example

The following is an example using the SAS library SASHELP which is automatically created when opening SAS. Some datasets are excluded due the sheer number of variables. The outputted document is attached to this article.

%data_specs(LIBN=sashelp, OUT=data_specs_example.xlsx,
    INDEX=id, WHERE=memname ^in('LEUTEST' 'LEUTRAIN') 
                               and substr(memname,1,1)^='V');

Norman21 · ‎04-22-2019

Hi Jeff,

Thanks for sharing this macro - I can already see myself using it regularly in future.

I have just one suggestion for improvement. For Table 3 (Summary of variables within each dataset), in addition to LONG and WIDE, is it possible to add a third option so that there is one variable per row, and four columns (one for each specification)? This would make the output consistent with the other tables, and facilitate quickly finding a variable of interest.

Thanks again, and I look forward to seeing your final SAS Global Forum 2019 presentation.

Norman.

JeffMeyers · ‎04-22-2019

Hello @Norman21,

That is a great suggestion and was actually quite simple to implement. Please see the additional option CONDENSED to the parameter FORMAT that outputs like the following screenshot:

I updated the downloadable macro on this webpage to include the option.

Thanks again for the feedback!

Tom · ‎04-22-2019

Why does it report that NAME has a format attached of $18.? When I run PROC CONTENTS on SASHELP.BASEBALL there is no format attached to the variable NAME.

JeffMeyers · ‎04-22-2019

Hello @Tom,

When a character variable does not have an actual format applied to it the default format it has is the length ($length to be exact). You can see this with the VFORMAT function:

data _null_;
    set sashelp.baseball (obs=1);
    x=vformat(name);
    put x=;
run;

tomrvincent · ‎04-25-2019

'library at datasets'? Is that meant to be 'library of datasets' or something else?

Tom · ‎04-25-2019

So that is a limitation of using the VFORMAT() function instead of querying the metadata (via PROC CONTENTS or the DICTIONARY.COLUMNS view). So for numbers without a format attached it returns `best12.` and for character variables it returns `$` format with a width that matches the length of the variable.

JeffMeyers · ‎04-25-2019

@tomrvincent thank you for pointing that out for me. It is supposed to be "library of datasets." I made the update.

@Tom I personally don't consider it a limitation for the following reason. When I'm making a macro variable that saves the format that I later need to use in a PUT function the VFORMAT function creates the exact value that I need including the period. If I pull from the dictionary tables or proc contents output table then I have to add more logic if there's no format listed.

Oleander_ · ‎09-04-2019

When I write "index=sex age" , I can't find these two variables in Table 2. But when I write "index=sex|age" they appeared again.

May I ask about the role of the "|"? Thanks~

JeffMeyers · ‎09-04-2019

@Oleander_ the | symbol is what the macro uses as a delimiter between variable names. I could probably just use spaces but when I wrote the macro I was still fairly new and now the option has been established at my work unit and is hard to change due to existing programs.

VDD · ‎09-26-2019

@JeffMeyers

when I use the macro and run it using this statement I get an error.

%data_specs(libn=sashelp,out=work);

how would I resolve this error?

ERROR: The text expression length (-30995) exceeds maximum length (65534). The text expression has been truncated to 65534
characters.

ERROR: An exception has been encountered.
Please contact technical support and provide them with the following traceback information:

The SAS task name is [Submit]
ERROR: Write Access Violation Submit
Exception occurred at (0599CE8C)
Task Tracebac

JeffMeyers · ‎03-24-2020

Hello @VDD,

I would not recommend running the macro on the entire SASHELP library. There are a few data sets and SQL views that have an enormous number of variables and observations and I believe it is overloading the macro in your case.

CarlosSpranger · ‎03-24-2020

Nice contribution @JeffMeyers.

Will save time in data prep for curat.

Thanks.

Cdmitnick · ‎04-16-2024

Hi @JeffMeyers ,

Thanks for this very useful tool.

I'm trying to use it now:

%macro data_specs(LIBN=derived, OUT=derived.endTB_derived_dict.xlsx,
INDEX=endTB_Case_ID, endTB_Subject_ID, id_baseline_template
CAT_THRESHOLD=15
FORMAT=condensed
ORDER=NAME
);

but no excel file is output.

I get no errors, but do get repeated notes that " NOTE: The quoted string currently being processed has become more than 262 bytes long. You might have unbalanced quotation marks" and

"NOTE 49-169: The meaning of an identifier after a quoted string might change in a future SAS release. Inserting white space between a quoted string and the succeeding identifier is recommended."

thanks,

Carole

Cdmitnick · ‎04-16-2024

Hi @JeffMeyers ,

please ignore my previous post (I have tried to delete/edit it, but was not able to). I solved those issues.

Now, it runs beautifully, but I get this error message:

" ERROR: Insufficient authorization to access /pbr/biconfig/940/Lev1/SASApp/endTB_derived_dict.xlsx." I get the same message whether I try to output the Excel file to a library or to work.

I'm working in the web-based SAS Studio environment, in case that offers insight.

here's the invocation of the macro:

%data_specs(LIBN=derived, OUT=endTB_derived_dict.xlsx,
INDEX=(endTB_Case_ID, endTB_Subject_ID, id_baseline_template),
CAT_THRESHOLD=15,
FORMAT=condensed,
ORDER=NAME

);
%data_specs

thanks,

carole

ballardw · ‎04-16-2024

Not looking into the code for the macro but it looks like your "OUT=" should include a fully qualified path starting at a disk mount point (or drive letter for Windows). The mount should be one that you have write privileges for.

something like (syntax appropriate for your operating system):

%data_specs(LIBN=derived, OUT=mymount/subfolder/endTB_derived_dict.xlsx,
INDEX=(endTB_Case_ID, endTB_Subject_ID, id_baseline_template),
CAT_THRESHOLD=15,
FORMAT=condensed,
ORDER=NAME

);

Without the path portion it is defaulting to attempting to send the created file to the location that SAS is executing from and you would be unlikely to have permissions there.

Cdmitnick · ‎04-17-2024

Thanks. I have previously successfully output to the location where SAS is operating. I just confirmed that I can write to that location by outputting the table resulting from a proc contents, admittedly not as .xlsx. On the contrary, unfortunately, I have not been able to output to my machine. I don’t know what the path would be to do that successfully.

If you have any other suggestions, I’d be grateful.
Thanks,
carole

sassy_seb · ‎04-30-2024

Anyone know how I can use this and not output all of the FREQ of the variables? Thank you!