in our book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study @hashman and @DonH presented the following example. It is described in the book at Program 6.12, file name of Chapter 6 Splitting Sorted Data via Hash.sas.
/* "Chapter 6 Splitting Sorted Data via Hash.sas" from the SAS Press book Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study */ data _null_ ; if _n_ = 1 then do ; dcl hash h (multidata:"Y") ; h.defineKey ("_n_") ; h.defineData ("League", "Team_SK", "Team_Name") ; h.defineDone () ; end ; do until (last.League) ; set bizarro.Teams ; by League ; h.add() ; end ; h.output (dataset: catx ("_", "work.League", League)) ; h.clear() ; run ;
It assumed the data set was sorted and so a BY group could be used to split the data set.
The follow-on question is how to do this if the data set is not sorted by the variable used to split the data set. And a related question is how to handle the case where the values perhaps need to be grouped or cannot be easily mapped to a valid SAS name.
In this article, we will present two alternative approaches (both using hash tables) that will create a data set for each value (or group of values) from the input DATA step. For these examples, we will be using the DW.Runs data set. It has a record for each run scored in each game, including the inning in which the run scored. We want to create a separate data set for each inning. Our first step is to create a lookup table that maps the inning number to the name of the data set that inning should be output to. For this program, we are using a PICTURE format; but we could have easily used a data set that we could be loaded into a hash table.
proc format; /* could also do this with a hash table */ picture inninglbl low-high = "09" (prefix="Inning_") ; run;
This format will create a separate data set for each inning. If we wanted to group them, we could use a value statement like:
value inningGrp 1-3 = "Innings_1_3" 4-6 = "Innings_4_6" 7-9 = "Innings_7_9" other = "Extras" ;
The first thing we need to do in our DATA step program is to create a list of all the values of the Inning variable. Note that for purposes of simplicity we are using Inning; we could make the name of the variable if we want more flexibility.
The following statements create a hash table and implicitly load the DW.Runs data set. Since the MULTIDATA argument tag is not specified, the resulting hash table has one item for each inning. And since there is no DEFINEDATA method call, the variable Inning is the default data portion hash variable for the hash table.
A hash iterator (Iter) is also defined to allow us to loop thru the hash table items (i.e. to enumerate them).
data _null_; dcl hash Innings(dataset:"DW.Runs"); Innings.defineKey("Inning"); Innings.defineDone(); Innings.output(dataset:"List"); dcl hiter Iter("Innings");
The next step is to loop thru the items in that hash table to create two variables:
length datasets outputStmts $32755 elseIf $8; elseIf = "if"; do while(Iter.next() = 0); datasets = catx(" ",datasets,put(Inning,inninglbl.)); outputStmts = strip(outputStmts) || strip(elseIf) || " put(Inning,inninglbl.)='" || put(Inning,inninglbl.) || "' then output " || put(Inning,inninglbl.) || ";" ; elseIf = "else if"; end;
Once this loop has completed execution, the values of the two variables are:
The next step is to execute the above code in a DATA step to create the output data sets. We could use CALL EXECUTE, which would cause the code to execute in a following DATA step. Instead, we decided to use the DOSUBL function which launches a separate SAS executive to execute the code immediately.
rc = dosubl("data " || strip(datasets) || "; set dw.runs;" || strip(outputStmts) || "run;"); stop; set dw.runs;
We use a STOP statement since all the data is read in a single execution of the DATA Step. As discussed in the article The SET Statement's Compile Time Functions, the SET statement after the STOP statement is used to define the PDV host variables. for each defined hash variable. The following show the SAS Log notes from running this program:
NOTE: There were 78346 observations read from the data set DW.RUNS. NOTE: There were 78346 observations read from the data set DW.RUNS. NOTE: The data set WORK.INNING_2 has 8696 observations and 5 variables. NOTE: The data set WORK.INNING_5 has 8635 observations and 5 variables. NOTE: The data set WORK.INNING_1 has 8895 observations and 5 variables. NOTE: The data set WORK.INNING_9 has 8533 observations and 5 variables. NOTE: The data set WORK.INNING_3 has 8545 observations and 5 variables. NOTE: The data set WORK.INNING_7 has 8741 observations and 5 variables. NOTE: The data set WORK.INNING_4 has 8848 observations and 5 variables. NOTE: The data set WORK.INNING_6 has 8590 observations and 5 variables. NOTE: The data set WORK.INNING_8 has 8863 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.21 seconds cpu time 0.20 seconds
An alternative approach is to create a separate hash table for each Inning or group of Innings. This uses something we refer to in the book, and discuss in Chapter 9, as a Hash of Hashes (HoH) - a hash table with a data portion variable containing non-scalar values of type hash pointing to other hash tables. The program uses this approach.
This program demonstrates creating an output data set for groups of innings. We do that with a VALUE statement in our PROC FORMAT step.
proc format; /* could also do this with a hash table */ value inningGrp 1-3 = "Innings_1_3" 4-6 = "Innings_4_6" 7-9 = "Innings_7_9" other = "Extras" ; run; /* this list code also be created using data driven approaches */ %let columns = "Game_SK", "Inning", "Top_Bot", "AB_Number", "Runner_ID";
It also creates a macro variable that contains the list of variables to be used as the argument for the DEFINEDATA method call used for each Inning_Group hash table. This macro variable could also be created by querying the Dictionary.Columns table in a separate SQL procedure step or the Sashelp.Vcolumn view within the same DATA step.
Our DATA step now needs to define a HoH table with a data portion hash variable whose values will point to the individual hash tables for each group of Innings. The following code does that:
data _null_; dcl hash Innings_HoH(); Innings_HoH.defineKey("Inning_Group"); Innings_HoH.defineData("Inning_Group","Hash_Pointer"); Innings_HoH.defineDone(); dcl hash Hash_Pointer();
The variable Inning_Group is created in the code below and is the formatted value of the Inning variable. It is used as the key for our HoH table, so we can have a hash table for each distinct value of the Inning_Group variable. The variable Hash_Pointer is defined as a non-scalar variable (of type hash object). it will be assigned a value as each new Inning_Group is encountered and its hash table created in the following code.
do until (lr); set dw.Runs end=lr; Inning_Group = put(Inning,InningGrp.); if Innings_HoH.find() ne 0 then do; /* a new inning group */ Hash_Pointer = _new_ hash(multidata:"Y"); Hash_Pointer.defineKey("_n_"); Hash_Pointer.defineData(&columns); Hash_Pointer.defineDone(); Innings_HoH.add(); end; /* a new inning group */ Hash_Pointer.add(); end;
As each observation is read:
And the final step is to use the hash iterator to loop (e.g., enumerate) thru the hash tables and create an output data set for each:
dcl hiter HoH_Iter("Innings_HoH"); do while(HoH_Iter.next() = 0); Hash_Pointer.output(dataset:Inning_Group); end; run;
The following show the SAS Log notes from running this program:
NOTE: The data set WORK.INNINGS_1_3 has 26136 observations and 5 variables. NOTE: The data set WORK.INNINGS_4_6 has 26073 observations and 5 variables. NOTE: The data set WORK.INNINGS_7_9 has 26137 observations and 5 variables. NOTE: There were 78346 observations read from the data set DW.RUNS. NOTE: DATA statement used (Total process time): real time 0.09 seconds cpu time 0.07 seconds
We plan to write more articles that leverage the Hash of Hash facility as well as the DOSUBL function.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.