The Hash Object, Recursion and Documentation

3 Likes

In our book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study, Paul Dorfman (@hashman) and I (@DonH) talked about how hash tables can do a lot more than just table lookup.

As the title indicates, we focused on its use in Business Intelligence tasks. Its use as a tool in generating documentation is our point here. We all know that programmers and developers just love to spend their time writing documentation (and gee do I wish there was a sarcasm font for that statement). One thing I have tried to do in my projects is to come up with ways to produce documentation using programs. For example, writing programs to pull off the header comment block for each program or macro.

Recently I was engaged on a project to help upgrade several applications from 9.1.3 to 94. To say they were not well designed or implemented would be an understatement. The code was scattered all over the place and as I started my review it was difficult to remember which of 10-20+ directories certain programs or macros were located in.

So the first thing I did was write a SAS program using wildcards to read in all the code lines from all the directories so I could search and generate some simple listings like an alphabetical list of each macro and where it was stored on the file system as well as a calling hierarchy for each application. Something like this what you see in the image at right.

Note that the data for this listing was obtained by subsetting the data for one of my client's real applications. The generated documents for each of those applications were upwards of 20 pages long - and that is even after I did some things to shorten the listings. For example, some utilities were called dozens and dozens of time from a single macro - thus I consolidated those into a single row called lineNumbers. In addition to subsetting, I also changed the names of all the programs to make them more generic.

The data that was used to generate this listing is simple. For each program or macro, there is a row for each program or macro that it calls. For example, SampleDriver.sas calls:

loadSystemPropertiesMacroVars
sysParmRead
SetCoreOSOptions
processData

and processData calls:

getEtlCounts
ExtractDataSourceA
ExtractDataSourceB
ExtractDataSourceC
ExtractDataSourceD

and so on, and so on.

Typically this would require recursion, as you go down the hierarchy and then back up, you need to keep track of where you are. For example, backing up from GetColumnNames to upddateMetrics to (getMetricDefinitions-GetColumnsNames-calculateMetrics-getDataSourcInfo), you have to keep track of what lines have already been written to the hierarchy. This snippet of data is not that hard. Remember, however, that this is a simplified subset and the hierarchy is somewhat consistent (i.e., you don't go level 2-3-3-4-3-4-5-5-6-7-2 ......)

To say that recursive programming can be complex would be an understatement. That is where the hash object comes into play - it is an ideal tool for that since your table is in memory and there are methods to navigate through its rows.

My initial thought was to use a stack or a queue (covered in Chapter 10: The Hash Object as a Dynamic Data Structure) to do that. Before getting too far into that approach it occurred to me that if I had a table for each macro with what it called, it might be simpler to deal with the recursion. And that idea is covered in Chapter 9: Hash of Hashes - Looping Through SAS Hash Objects. Just create a hash table for each macro. Once I adopted that approach, the recursive nature of the problem remained. But I did not have to write recursive code. I simply needed to query the right hash table for each macro in the right order. So my recursion problem got replaced by iterating through my Hash of Hash (aka HoH) tables.

The code to do that follows.

Macro variables are used to specify the root directory for the data and the output as well as which program is the top level programs for our calling hierarchy. The if 0 and length statements define the PDV host variables for the data portions of our hash tables.

%let root= C:\HCS\Projects\callingHierarchy;
libname code "&root\Data";
%let Driver = SampleDriver;

data _null_;
 if 0 then set code.callingHierarchy;
 length Calling Called $32;

Next we need to create our hash tables. They only need to be defined once; thus, they are contained inside a DO-END block whose statements are only executed on the first execution of the DATA step.

Our HoH hash table will contain the pointers to the hash object instances (or hash tables) for each program that calls other programs. Note that the variable Key must be scalar (i.e., numeric or character), and so we use the Program name. That is also needed, so that the FIND method could be used to retrieve the non-scalar values H and HITER for each individual hash table and the associated hash iterator from the table HoH.

Next, we create the input and output hash tables to contain our calling hierarchy. We also declare two corresponding hash iterator objects to cycle through the tables serially as many times as needed until our hierarchy is fully expanded (thus using an iterative process to circumvent recursion). As seen in the code below, the contents of the input (callingTreeIn) table is copied and expanded and added to the output (callingTreeOut) table. Note that we use a constant value for our Key variable (defined below) since we need both (a) a key value and (b) want the data-values stored in each hash table in the order they are added. This feature of hash table keys is part of the reason we can solve this problem by using iterative logic instead of recursive.

 if _n_ = 1 then
 do;  /* create a Hash of Hashes Table that contains a hash table for
         each program or macro that calls another program or macro */
    dcl hash HoH();  
    HoH.defineKey ("Program");
    HoH.defineData("h","hiter","Program");
    HoH.defineDone();
    dcl hash h;
    dcl hiter hiter;

    dcl hash callingTreeIn(multidata:"y");
    callingTreeIn.defineKey("Key");
    callingTreeIn.defineData("Key","Level","Calling","lineNumbers","Called","Expanded");
    callingTreeIn.defineDone();
    dcl hiter callingTreeInIter("callingTreeIn");

    dcl hash callingTreeOut(multidata:"y");
    callingTreeOut.defineKey("Key");
    callingTreeOut.defineData("Key","Level","Calling","lineNumbers","Called","Expanded");
    callingTreeOut.defineDone();
    dcl hiter callingTreeOutIter("callingTreeOut");
 end; /* create a Hash of Hashes Table that contains a hash table for
         each program or macro that calls another program or macro */

Our next action is to create a hash table for each calling program, and add it to our Hash of Hash table HoH. A few notes about this section of code:

A DoW loop is used to make each DATA step execution process the input rows with its own distinct value of Program.
Since the _NEW_ operator and the subsequent methods are called only on the first row of each Program BY group, only one pair of hash instance of H and (linked to it) iterator instance HITER is created per one distinct value of Program. Their corresponding distinct non-scalar pointer values H and HITER are then stored in the hash table HOH.
The variables in this set of tables have, by design, names different from those in the calling hierarchy hash tables. This is done to prevents collisions in the PDV as we navigate through the various tables.
Since we need to create the hash tables for all the programs/macros before we can start creating our calling hierarchy, we process the rest of the code only once all the data from our calling hierarchy data set (code.callingHierarchy) has been read.

 do until(last.Program);
    /* create the individual hash tables for each macro that calls other macros */
    set code.callingHierarchy end = lr;
    by Program notsorted;
    if first.Program then
    do;  /* create the table of what it calls */
       h = _new_ hash(multidata:"y");
       hiter = _new_ hiter("h");
       h.defineKey("_n_");
       h.defineData("Program","Calls","lineNumbers");
       h.defineDone();
       HoH.add();
    end; /* create the table of what it calls */
    h.add(); /* add the completed hash table to the Hash of Hashes table */
 end;
 if not lr then return;

Now that once we have created a hash table for each calling program, we can begin to create our calling hierarchy. We do that by calling the ADD method to load a single row into the callingTreeIn table. ADD is called explicitly, i.e. using its argument tags KEY and DATA. Note that there is no calling Program and thus no line numbers for our driver, hence the corresponding argument tags are set to blank. By loading the starting point this way, we can start our calling hierarchy at any node in the complete hierarchy. For example, you might want to create a calling hierarchy for a specific entry point (e.g., Extract_DataSourceA in the above listing).

 /* create the complete calling hierarchy once all the macro call hash tables
    have been built
 */
 
 /* first load the initial driver program */
 Key = 1;
 Level = 1;
 callingTreeIn.add(Key:Key
                  ,Data:Key
                  ,Data:Level
                  ,Data:" "
                  ,Data:" "
                  ,Data:"&Driver..sas"
                  ,Data:0
                  );

So, we are now ready to begin iterating to create our calling hierarchy by looping thru the rows in our input hash table and adding them to our output hash table, while, at the same time, adding rows (expanding) for programs called by our current calling Program.

We set a variable, foundHoH, to 0, so we can determine if the current iteration had any programs that were found in our HoH table and need to be expanded. Once all the calling programs have been expanded, we can end iterating (i.e., why not foundHoH is part of our DO UNTIL clause).
As a safety measure, we add a condition to our DO UNTIL loop (maxLevel gt 100) to prevent an infinite loop, should the data have some form of recursion in it. The value of 100 is arbitrary.
Our DO WHILE loop uses the iterator HITER to loop thru our input table and load the values from the hash table callingTreeIn to the callingTreeOut hash table.
Note that the last item (the Expanded field) is set to 1 to indicate that this row will be expanded in the current iteration. We need to copy each input row to the output hash table on each iteration, but we only want to expand it once.
Next, we check to see if the currently called Program is found is the HoH table:
- If it is found, the programs that it calls, along with the line numbers, are copied to their PDV host variables and each row retrieved is added to our output hash table.
- Note that if the table has been expanded in a prior iteration, we do not enter the loop that adds rows to the output table.
- The Expanded field is set to 1 to indicate that this program might need to be expanded in the next iteration. We could have simply coded a 0 on the add method call. But if we did that, SAS would generate an "uninitialized" message for Expanded.
We also increment our foundHoH field to indicate that another iteration is needed.
Once we have read all the rows in our input table, we need to prepare for the next iteration. That is what the prepNextLevel LINK routine (see below) does.
Once we are done iterating, we create an output data set and use the name of the calling program as our SAS table name. If the calling program name cannot be used as the output data set name (easily determined via the NVALID function), a constant like callingHierarchy could be used.
We no longer need the variable Calling so we drop it from the output table. We needed Calling and Called to facilitate the looping. Since our output data set has separate rows with an incremented value of the variable Level it is not needed for our report. Likewise the variables Key and Expanded are not needed.

 /* loop until the hierarchy is fully expanded */
 do until(not foundHoH or maxLevel gt 100); /* limit to 100 to prevent an infinite loop */
    foundHoH = 0; /* intialize a counter so we know we have gotten to the bottom */
    do while(callingTreeInIter.next()=0);
       /* Unconditionally add the record and mark it as expanded in this loop */
       callingTreeOut.add(Key:Key
                         ,Data:Key
                         ,Data:Level
                         ,Data:Calling
                         ,Data:lineNumbers
                         ,Data:Called
                         ,Data:1
                         );
       if HOH.find(Key:Called) = 0 and not Expanded then
       do;  /* the called macro calls other macros - expand it */
          do while(hiter.next()=0);
             Expanded = 0; /* don't hard-code in call, suppress uninitialized message */
             callingTreeOut.add(Key:Key
                               ,Data:Key
                               ,Data:Level+1
                               ,Data:Program
                               ,Data:lineNumbers
                               ,Data:Calls
                               ,Data:Expanded
                               );
             foundHoH + 1;  /* we have calling macros so we need to loop again */ 
          end;
       end; /* the called macro calls other macros - expand it */
    end;
    link prepNextLevel;
 end;
 callingTreeIn.output(dataset:"&Driver(drop=Key Calling Expanded");
 stop;

As we prepare to iterate again, the content of callingTreeOut table needs to be used as the input to the next iteration. Thus, the LINK routine prepNextLevel code simply:

clears the input table
copies the items from the output table to the input table
clears the output table

We are now ready to begin our next iteration.

 prepNextLevel: 
   /* load the data from the previous level to the input hash table
      and initialize the output hash table to empty */
   callingTreeIn.clear();
   do while(callingTreeOutIter.next()=0);
      maxLevel = max(maxLevel,Level);
      callingTreeIn.add();
   end;
   callingTreeOut.clear();
 return;
run;

Our final step is to produce a listing like what is included above. Note that we want programs indented based on their level in the calling hierarchy. So we create a variable called stub whose value is the contcatenation of the Level and Called variables That value is concatenated to a string of non-breakable spaces (the "A0"X values). Those allow for the value to be indented regarding of whether we create rtf, pdf or html output.

A variable (Ord) is also created to force PROC REPORT to preserve the input order in the output report. The PROC REPORT step produces the output seen above

options nodate orientation=portrait;

ods rtf file="&root\Reports\&Driver..rtf";

data report;
 set &Driver;
 ord+1;
 stub = cats(repeat("A0"x,(level-1)*5),catx(' - ',level,Called));
run;

proc report data=report
     style(report) = [rules=none frame=void cellpadding=2 cellspacing=2];
 title "&Driver Calling Hierarchy";
 %let dt = %sysfunc(date(),worddate.);
 %let tm = %sysfunc(time(),timeampm8.);
 footnote "Generated at &tm on &dt";
 columns ord Stub lineNumbers;
 define ord / order noprint;
 define Stub / "Level and Program" display;
 define lineNumbers / "At Line Number(s)";
run;

ods rtf close;

The above code, along with the sample input data and output will be made available as a GitHub Project.

The Hash Object, Recursion and Documentation

Registration is open

SAS AI and Machine Learning Courses