Splitting a SAS data set based on the value of a variable

4 Likes

in our book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study @hashman and @DonH presented the following example. It is described in the book at Program 6.12, file name of Chapter 6 Splitting Sorted Data via Hash.sas.

/* "Chapter 6 Splitting Sorted Data via Hash.sas" from the SAS Press book
      Data Management Solutions Using SAS Hash Table Operations:
      A Business Intelligence Case Study
*/

data _null_ ;
  if _n_ = 1 then do ;
    dcl hash h (multidata:"Y") ;
    h.defineKey ("_n_") ;
    h.defineData ("League", "Team_SK", "Team_Name") ;
    h.defineDone () ;
  end ;
  do until (last.League) ;
    set bizarro.Teams ;
    by League ;
    h.add() ;
  end ;
  h.output (dataset: catx ("_", "work.League", League)) ;
  h.clear() ;
run ;

It assumed the data set was sorted and so a BY group could be used to split the data set.

The follow-on question is how to do this if the data set is not sorted by the variable used to split the data set. And a related question is how to handle the case where the values perhaps need to be grouped or cannot be easily mapped to a valid SAS name.

In this article, we will present two alternative approaches (both using hash tables) that will create a data set for each value (or group of values) from the input DATA step. For these examples, we will be using the DW.Runs data set. It has a record for each run scored in each game, including the inning in which the run scored. We want to create a separate data set for each inning. Our first step is to create a lookup table that maps the inning number to the name of the data set that inning should be output to. For this program, we are using a PICTURE format; but we could have easily used a data set that we could be loaded into a hash table.

proc format;
 /* could also do this with a hash table */
 picture inninglbl low-high = "09" (prefix="Inning_") ;
run;

This format will create a separate data set for each inning. If we wanted to group them, we could use a value statement like:

value inningGrp 1-3 = "Innings_1_3"
                4-6 = "Innings_4_6"
                7-9 = "Innings_7_9"
              other = "Extras"
;

The first thing we need to do in our DATA step program is to create a list of all the values of the Inning variable. Note that for purposes of simplicity we are using Inning; we could make the name of the variable if we want more flexibility.

The following statements create a hash table and implicitly load the DW.Runs data set. Since the MULTIDATA argument tag is not specified, the resulting hash table has one item for each inning. And since there is no DEFINEDATA method call, the variable Inning is the default data portion hash variable for the hash table.

A hash iterator (Iter) is also defined to allow us to loop thru the hash table items (i.e. to enumerate them).

data _null_;
 dcl hash Innings(dataset:"DW.Runs");
 Innings.defineKey("Inning");
 Innings.defineDone();
 Innings.output(dataset:"List");
 dcl hiter Iter("Innings");

The next step is to loop thru the items in that hash table to create two variables:

the list of data set names.
the statements to output an observation to the appropriate data set.

 length datasets outputStmts $32755 elseIf $8;
 elseIf = "if";
 do while(Iter.next() = 0);
    datasets  = catx(" ",datasets,put(Inning,inninglbl.));
    outputStmts = strip(outputStmts)
               || strip(elseIf)
               || " put(Inning,inninglbl.)='"
               || put(Inning,inninglbl.)
               || "' then output "
               || put(Inning,inninglbl.)
               || ";"
    ;
    elseIf = "else if";
 end;

Once this loop has completed execution, the values of the two variables are:

Datasets
Inning_2 Inning_5 Inning_1 Inning_9 Inning_3 Inning_7 Inning_4 Inning_6 Inning_8
Statements (linefeeds added here for readability)
if put(Inning,inninglbl.)=' Inning_2' then output Inning_2;
else if put(Inning,inninglbl.)=' Inning_5' then output Inning_5;
else if put(Inning,inninglbl.)=' Inning_1' then output Inning_1;
else if put(Inning,inninglbl.)=' Inning_9' then output Inning_9;
else if put(Inning,inninglbl.)=' Inning_3' then output Inning_3;
else if put(Inning,inninglbl.)=' Inning_7' then output Inning_7;
else if put(Inning,inninglbl.)=' Inning_4' then output Inning_4;
else if put(Inning,inninglbl.)=' Inning_6' then output Inning_6;
else if put(Inning,inninglbl.)=' Inning_8' then output Inning_8;

The next step is to execute the above code in a DATA step to create the output data sets. We could use CALL EXECUTE, which would cause the code to execute in a following DATA step. Instead, we decided to use the DOSUBL function which launches a separate SAS executive to execute the code immediately.

 rc = dosubl("data "
          || strip(datasets)
          || "; set dw.runs;"
          || strip(outputStmts)
          || "run;");
 stop;
 set dw.runs;

We use a STOP statement since all the data is read in a single execution of the DATA Step. As discussed in the article The SET Statement's Compile Time Functions, the SET statement after the STOP statement is used to define the PDV host variables. for each defined hash variable. The following show the SAS Log notes from running this program:

NOTE: There were 78346 observations read from the data set DW.RUNS.
NOTE: There were 78346 observations read from the data set DW.RUNS.
NOTE: The data set WORK.INNING_2 has 8696 observations and 5 variables.
NOTE: The data set WORK.INNING_5 has 8635 observations and 5 variables.
NOTE: The data set WORK.INNING_1 has 8895 observations and 5 variables.
NOTE: The data set WORK.INNING_9 has 8533 observations and 5 variables.
NOTE: The data set WORK.INNING_3 has 8545 observations and 5 variables.
NOTE: The data set WORK.INNING_7 has 8741 observations and 5 variables.
NOTE: The data set WORK.INNING_4 has 8848 observations and 5 variables.
NOTE: The data set WORK.INNING_6 has 8590 observations and 5 variables.
NOTE: The data set WORK.INNING_8 has 8863 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.21 seconds
      cpu time            0.20 seconds

An alternative approach is to create a separate hash table for each Inning or group of Innings. This uses something we refer to in the book, and discuss in Chapter 9, as a Hash of Hashes (HoH) - a hash table with a data portion variable containing non-scalar values of type hash pointing to other hash tables. The program uses this approach.

This program demonstrates creating an output data set for groups of innings. We do that with a VALUE statement in our PROC FORMAT step.

proc format;
 /* could also do this with a hash table */
 value inningGrp 1-3 = "Innings_1_3"
                 4-6 = "Innings_4_6"
                 7-9 = "Innings_7_9"
               other = "Extras"
 ;
run;

/* this list code also be created using data driven approaches */
%let columns = "Game_SK", "Inning", "Top_Bot", "AB_Number", "Runner_ID";

It also creates a macro variable that contains the list of variables to be used as the argument for the DEFINEDATA method call used for each Inning_Group hash table. This macro variable could also be created by querying the Dictionary.Columns table in a separate SQL procedure step or the Sashelp.Vcolumn view within the same DATA step.

Our DATA step now needs to define a HoH table with a data portion hash variable whose values will point to the individual hash tables for each group of Innings. The following code does that:

data _null_;
 dcl hash Innings_HoH();
 Innings_HoH.defineKey("Inning_Group");
 Innings_HoH.defineData("Inning_Group","Hash_Pointer");
 Innings_HoH.defineDone();
 dcl hash Hash_Pointer();

The variable Inning_Group is created in the code below and is the formatted value of the Inning variable. It is used as the key for our HoH table, so we can have a hash table for each distinct value of the Inning_Group variable. The variable Hash_Pointer is defined as a non-scalar variable (of type hash object). it will be assigned a value as each new Inning_Group is encountered and its hash table created in the following code.

 do until (lr);
    set dw.Runs end=lr;
    Inning_Group = put(Inning,InningGrp.);
    if Innings_HoH.find() ne 0 then
    do;  /* a new inning group */
       Hash_Pointer = _new_ hash(multidata:"Y");
       Hash_Pointer.defineKey("_n_");
       Hash_Pointer.defineData(&columns);
       Hash_Pointer.defineDone();
       Innings_HoH.add();
    end; /* a new inning group */
    Hash_Pointer.add();
  end;

As each observation is read:

If the value of Inning Group is a new value:
- a hash table is created.
- its non-scalar identifier is added to the Hash of Hashes table as the value of Hash_Pointer
The current data row is added to the appropriate hash table as another hash item.

And the final step is to use the hash iterator to loop (e.g., enumerate) thru the hash tables and create an output data set for each:

  dcl hiter HoH_Iter("Innings_HoH");
  do while(HoH_Iter.next() = 0);
     Hash_Pointer.output(dataset:Inning_Group);
  end;
run;

The following show the SAS Log notes from running this program:

NOTE: The data set WORK.INNINGS_1_3 has 26136 observations and 5 variables.
NOTE: The data set WORK.INNINGS_4_6 has 26073 observations and 5 variables.
NOTE: The data set WORK.INNINGS_7_9 has 26137 observations and 5 variables.
NOTE: There were 78346 observations read from the data set DW.RUNS.
NOTE: DATA statement used (Total process time):
      real time           0.09 seconds
      cpu time            0.07 seconds

We plan to write more articles that leverage the Hash of Hash facility as well as the DOSUBL function.

@hashman and @DonH would like to acknowledge @RichardDeVen who previously published a program that uses a Hash of Hashes approach to split a dataset based on the value of a variable.

Ksharp · ‎08-25-2018

You could use " h.definedata(all:'yes'); " to include all the variables, no need to list them all.

data class;
 set sashelp.class;
run;
proc sort data=class;
by sex;
run;

data _null_;
 if _n_=1 then do;
   if 0 then set class;
   declare hash h(dataset:'class(obs=0)',multidata:'yes');
   h.definekey('sex');
   h.definedata(all:'yes');
   h.definedone();
 end;

do until(last.sex);
 set class;
 by sex;
 h.add();
end;
h.output(dataset:sex);
h.clear();
run;

DonH · ‎08-25-2018

Great point when you want them all @Ksharp

hashman · ‎08-25-2018

@Ksharp, @DonH:

Agree that ALL:"Y" is a good shortcut when you need all variables and exactly in the sequence they are in the data set being split.

If you want them to be output in a different sequence, giving their names to the DEFINEDATA call (or successive calls) is a good way to achieve that since they are output in the sequence they are defined in the data portion of the table (rather than in the PDV).

If you need just some of them, it is of course a good idea to also KEEP= only those while reading the input file. To avoid typing the same list more than once and also to avoid typing many pesky quotes (especially if the list is rather long), I find it practical to define the list once up front and then let SAS do the rest of the job, for example:

%let vv = League Team_SK Team_Name ;

data _null_ ;
  if _n_ = 1 then do ;
    dcl hash h () ; /* Note: no MULTIDATA */
    h.defineKey ("_n_") ;
    do _n_ = 1 to countW ("&vv") ;
      h.defineData (scan ("&vv", _n_)) ;
    end ;
    h.defineDone () ;
  end ;
  do _n_ = 1 by 1 until (last.League) ; /* Note: with _n_ = 1 by 1, no MULTIDATA needed */
    set bizarro.Teams (keep = &vv) ;
    by League ;
    h.add() ;
  end ;
  h.output (dataset: catx ("_", "work.League", League)) ;
  h.clear() ;
run ;

Note that I had to type the list only once up top and didn't have to use any quotes. (By the way, this technique is described in the book.)

Also note that if you re-number the records in the DoW-loop's BY group using _N_, as shown above, it automatically renders all key-values of _N_ in each BY group unique, so MULTIDATA doesn't have to be coded.

p.s. @Ksharp: In your code, IF 0 THEN SET statement is redundant since another SET pointing to the same data set is already in the step.

DonH · ‎08-25-2018

And perhaps a more important tip/technique than

h.definedata(all:'yes');

is the use of the dataset argument tag with OBS=0 in the DECLARE statement

dataset:'class(obs=0)'

since the all option requires a value for the DATASET argument tag. The use of OBS=0 allows the data set to be specified in the DATASET argument tag while loading the data a row/observation at a time.

ChrisNZ · ‎08-26-2018

Just a couple of petty comments:

1- I think it is a good habit to left-justify formatted numeric values when using them for string comparison

2- I would have used a select statement to derive the dispatch logic.

 length datasets outputStmts $32767 ;
 outputStmts = "select(put(Inning,inninglbl. -l));";
 do while(Iter.next() = 0);
    datasets  = catx(" ",datasets,put(Inning,inninglbl.));
    outputStmts = catx( ' '
                      , outputStmts
                      , "when("       
                      , quote(put(Inning,inninglbl. -l)) 
                      , ") output" 
                      , put(Inning,inninglbl. -l)
                      , ";" )     
    ;        
 end; 
  rc = dosubl("data "
            || strip(datasets)
            || "; set runs;"
            || strip(outputStmts)
            || "otherwise; end; run;");

This does not change the value or usefulness of this article of course.

Just a (slightly) different viewpoint to maybe (slightly) enrich the article .

Splitting a SAS data set based on the value of a variable

Free course: Data Literacy Essentials

Get Started