BookmarkSubscribeRSS Feed

PROC DOC III: Self-generating Codebooks Using SAS®

Started ‎09-18-2020 by
Modified ‎08-02-2019 by
Views 3,960

This article will demonstrate how to use good documentation practices and SAS® to easily produce
attractive, camera-ready data codebooks (and accompanying materials such as label statements, format
assignment statements, etc.) Four primary steps in the codebook production process will be explored:
use of SAS metadata to produce a master documentation spreadsheet for a file; review and modification
of the master documentation spreadsheet; import and manipulation of the metadata in the master
documentation spreadsheet to self-generate code to be included to generate a codebook; and use of the
documentation metadata to self-generate other helpful code such as label statements. Full code for the
example shown (using the SASHELP.HEART data base) is attached. 

 

The most onerous task any SAS programming professional faces is to accurately document files and
processes. The truth is that there are no easy answers to the documentation quandary. It takes hard,
painstaking work! By setting careful standards at the outset of a programming task, documenting your
processes, labelling your data files and variables, providing value labels (formats) for your variables
when appropriate, and using the many tools the SAS® system provides to assist in the documentation
process, producing codebooks can be a piece of cake.

You’ve done a lot of hard work documenting every aspect of your programming project, and now it is time
to reap your rewards. There are a number of ways that you can present information from PROC
CONTENTS and PROC DATASETS covered in many other papers, including some of my own. We are
going to focus on the use of an intermediate spreadsheet to drive creation of a robust codebook with selfgenerating code.


STEP 1
It is important to review and evaluate the metadata associated with the data set to be documented. Data
sets should be labeled accurately. Variables should be labelled accurately. If variables have informats or
formats, that information should be available and accurate. There should be a program available to
create a permanent format library with a two level catalog name, if applicable – and those formats should
be accurate. For our example, we create an age category variable that we wish to format, and write a
program to generate a format in a permanent, two-level format catalog.


Code snippet from 1gen_formats_PHARMASUG_2017_QT07.sas (in zip file attached to this article):

 

TITLE1 "PHARMASUG 2017 QT07";
FOOTNOTE1 "%SYSFUNC(GETOPTION(SYSIN)) - &SYSDATE - &SYSTIME - run by
&SYSUSERID in &SYSPROCESSMODE";
RUN;
LIBNAME dd '.';
LIBNAME library '.';
FILENAME odsout '.';
RUN;
PROC FORMAT LIBRARY=LIBRARY.HEART;
VALUE startage 25 - 34='25 to 34 years'
35 - 44='35 to 44 years'
PROC DOC III: Self-generating Codebooks Using SAS®, continued
2
45 - 54='45 to 55 years'
55 - 64='55 to 64 years';
VALUE agefmt 1='25 to 34 years'
2='35 to 44 years'
3='45 to 54 years'
4='55 to 64 years';
RUN;

STEP 2


In the example shown below, a Microsoft Excel® spreadsheet with selected variables from PROC
CONTENTS output is generated using PROC EXPORT in program
2gen_metadata_PHARMASUG_2017_QT07.sas. I am using a modified copy of SASHELP.HEART as the
sample data set for several reasons, one of which is that not all variables are labelled, requiring some
changes. Another reason is that this data set is available to all SAS users.


Code snippet from gen_metadata_PHARMASUG_2017_QT07.sas:

 

DATA dd.heart (LABEL="Copy of SASHELP.HEART for PHARMASUG 2017 QT07-
created by %SYSFUNC(GETOPTION(SYSIN))
- &SYSDATE - &SYSTIME - run by &SYSUSERID in &SYSPROCESSMODE");
LENGTH dslabel $ 200 source $ 32;
SET sashelp.heart;
/* put in some missing labels */
dslabel="Copy of SASHELP.HEART for PHARMASUG 2017 QT07- created by
%SYSFUNC(GETOPTION(SYSIN))
- &SYSDATE - &SYSTIME - RUN by &SYSUSERID in &SYSPROCESSMODE";
source="&dsname";
IF 25 LE ageatstart LE 34 THEN age=1;
IF 35 LE ageatstart LE 44 THEN age=2;
IF 45 LE ageatstart LE 54 THEN age=3;
IF 55 LE ageatstart LE 64 THEN age=4;
IF ageatstart ge 85 THEN age=7;
FORMAT age agefmt.;
LABEL cholesterol='Cholesterol level'
diastolic='Diastolic blood pressure'
height='Height'
sex='Gender'
smoking='Cigarettes per day'
status='Wanted, dead or alive'
systolic='Systolic blood pressure'
weight='Weight'
source='Data set name'
dslabel='Data set information'
age='Age at Start Category'
;
RUN;

. . . PROC EXPORT DATA = dd.heart_cb DBMS = excel
OUTFILE = ".\heart_db.xlsx" REPLACE;
RUN;

Of course, you want to review the results of your spreadsheet creation in Excel and maybe modify a label or format assignment. Note that I have created a variable / column indicating a specialized variable type (VARTYPE), as I want to treat formatted variables differently from unformatted variables in the codebook.

 

pic1.jpg

You can then reimport the modified spreadsheet for use in the next step to: (a) write code to be included to generate a codebook with output varying by variable type; (b) write code to generate a label statement; and (c) write code to generate a format assignment statement, among other normally onerous tasks.

 

STEP 3

 

The codebook generation program, 3_gen_codebook_PHARMASUG_2017_QT07.sas, starts with reimporting the edited version of the metadata spreadsheet, shown above. A number of macros are then constructed: to report on “header information” (i.e. variable name, label, etc.), missing values, and then details on non-missing values, differential by variable type (character, continuous, categorical). Additionally, the program accesses the metadata and outputs text files with macro calls to the macros created above conditional upon the variable type in the metadata and reporting macros, that are then reused in the program as include files.

 

Code snippet from 3gen_codebook_PHARMASUG_2017_QT07.sas:

 

 

DATA _null_;
FILE out1 LRECL=80 PAD;
LENGTH include_string $ 80;
SET dd.heart_cb (KEEP=varnum name vartype);
include_string=CATS('%header(',name,",",varnum,");"); 
PUT include_string; RUN; . . . DATA _null_; FILE out4 LRECL=80 PAD; LENGTH include_string $ 80; SET dd.heart_cb (KEEP=varnum name vartype); IF vartype=1 THEN include_string=CATS('%printtable(',varnum,");");
IF vartype=2 THEN include_string=CATS('%printtablec(',varnum,");");
IF vartype=3 THEN include_string=CATS('%printblurb(',varnum,");"); PUT include_string;
RUN;

 

Macros are written to report on each variable, creating an RTF codebook. These printing macros are utilized in the %include files written by the program inside a TAGSETS.RTF sandwich.

 

Code snippet from 3gen_codebook_PHARMASUG_2017_QT07.sas:

 

 

%MACRO printblurb(order);
ODS TAGSETS.RTF STYLE=styles.noborder; 
ODS STARTPAGE=no; PROC REPORT NOWD DATA=print&order STYLE(report)=[cellpadding=3pt vjust=b] STYLE(header)=[just=center font_face=Helvetica font_weight=bold font_size=10pt] STYLE(lines)=[just=left font_face=Helvetica] ; COLUMNS blurb ; DEFINE blurb / style(COLUMN)={just=l font_face=Helvetica font_size=10pt cellwidth=988 } style(HEADER)={just=l font_face=Helvetica font_size=10pt }; RUN; ODS STARTPAGE=no;
%MEND;

 

pic2.jpg

 

The codebook construction can take some time. Arrange to send yourself a text message with the condition code of your job when it finishes, and get a cup of coffee.

 

Code snippet from 3gen_codebook_PHARMASUG_2017_QT07.sas:

 

FILENAME msg EMAIL TO="0000000000@txt.att.net" 
FROM = "Big Nerd <louise_hadden@abtassoc.com>"
SUBJECT="All Systems Go (or not)?"; DATA _null_; FILE msg; PUT "Program Path and Name: %SYSFUNC(GETOPTION(SYSIN))"; PUT "RUN &SYSDATE - &SYSTIME - by &SYSUSERID in &SYSPROCESSMODE"; PUT "Condition Code is &SYSCC."; RUN;

STEP 4

 

Similarly, metadata can be accessed to create label, format, and length, etc. statements.

 

Code snippet from 4gen_label_fmt_stmnt_PHARMASUG_2017_QT07.sas:

 

 

DATA temp1;
LENGTH include_string $ 180; SET dd.heart_cb;
label=COMPRESS(label,'"');
qlabel=CATS('"',label,'"');
include_string=CATX(' ',name,'=',qlabel); RUN;
DATA templabel (KEEP=include_string);
FILE out1 LRECL=180 PAD;
LENGTH include_string $ 180;
SET runlabel temp1 runrun;
PUT include_string;
RUN;
DATA temp2; LENGTH include_string $ 180; SET dd.heart_cb (WHERE=(format NE ''));
qformat=CATS(format,'.'); include_string=CATX(' ',name,qformat); RUN; DATA tempfmt (KEEP=include_string); FILE out2 LRECL=180 PAD; LENGTH include_string $ 180; SET runformat temp2 runrun; PUT include_string; RUN;

The resulting statement, example shown below, can be included in other programs seamlessly.

 

 

 

pic3.jpg

***************************************************************************************************************************************************

The author gratefully acknowledges the helpful work of Kathy Fraeman, Michael Raithel, Patrick Thornton, Troy Martin Hughes, Richann Watson, Roberta Glass and Kirk Paul Lafler, among others.

Version history
Last update:
‎08-02-2019 08:37 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags