Performance - Comparing SQL, MERGE and the Hash Object to Join/Merge SAS Tables

8 Likes

Assessing performance is always difficult. There are so many variables that can impact performance that generic answers as to which approaches are better, the answer is almost always IT DEPENDS. An answer that virtually no one likes!

That is why when Paul (aka @hashman) and I (aka @DonH) worked on our SAS Press Book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study, we did not try to compare performance of the hash object with alternative techniques.

Tackling this question here for a specific case study is different for the simple reason that we can engage in a Q&A thanks to the facilities of the communities site. In this article we provide an example of comparing SQL, the MERGE statement and the hash object to merge/join two data files. We specifically chose what is referred in in SQL as a left join (more on that later). It is worth noting here that a left join becomes a right join merely by reversing the order of the table listed. Specifically, the following two joins in SQL produce the same output data sets (acknowledging that the order of the variables might be different):

A left join B
B right join A

Likewise, if a MERGE statement is used, these produce the same output results (acknowledging that the order of the variables might be different):

merge A(in=a) B;
if a;
merge B A(in=a);
if a

For the purposes of this comparison, we ran the following examples using SAS University Edition on Windows 10. The program is attached to this article so readers can download it and run the comparisons in their environment. The datasets used are available in the downloadable zip file for our book.

Joining with SQL

The SAS log follows.

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         options fullstimer;
 74         
 75         proc sql noprint;
 76          create table sqlMerge as
 77           select pitches.*
 78                 ,games.*
 79           from dw.pitches
 80                left join
 81                dw.games
 82           on games.game_sk = pitches.game_sk
 83          ;
 WARNING: Variable Game_SK already exists on file WORK.SQLMERGE.
 NOTE: Table WORK.SQLMERGE created, with 875764 rows and 24 columns.
 
 84         quit;
 NOTE: PROCEDURE SQL used (Total process time):
       real time           4.73 seconds
       user cpu time       0.61 seconds
       system cpu time     1.06 seconds
       memory              160496.07k
       OS Memory           189152.00k
       Timestamp           01/01/2019 03:31:53 PM
       Step Count                        35  Switch Count  4
       Page Faults                       7
       Page Reclaims                     38727
       Page Swaps                        0
       Voluntary Context Switches        8251
       Involuntary Context Switches      1304
       Block Input Operations            1024
       Block Output Operations           618032

@DonH is obsessed with eliminating WARNING messages - even ones that are expected and don't indicate a problem. If there are expected WARNING, or ERROR messages, it is easy to miss such messages that are not expected. So the above SQL code has been revised to not use games.*, instead we create a macro variable that is a SQL compliant list of the variables (excluding the join key). This adds a bit of time; but having a WARNING free log is worth it.

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         proc sql noprint;
 74          /* add a dictionary.columns table to create the list of variables, exlcuding the duplicate key
 75             in order to eliminate the duplicate column name message
 76          */
 77          select catx('.',memname,name) into:gamesColumns separated by ", "
 78          from dictionary.columns
 79          where libname = "DW"
 80            and memname = "GAMES"
 81            and upcase(name) ne "GAME_SK"
 82          ;
 83          create table sqlMerge as
 84           select pitches.*
 85                 ,&gamesColumns
 86           from dw.pitches
 87                left join
 88                dw.games
 89           on games.game_sk = pitches.game_sk
 90          ;
 NOTE: Table WORK.SQLMERGE created, with 875764 rows and 24 columns.
 
 91         quit;
 NOTE: PROCEDURE SQL used (Total process time):
       real time           5.83 seconds
       user cpu time       0.52 seconds
       system cpu time     1.24 seconds
       memory              160445.87k
       OS Memory           189920.00k
       Timestamp           01/01/2019 03:33:12 PM
       Step Count                        41  Switch Count  5
       Page Faults                       2
       Page Reclaims                     39037
       Page Swaps                        0
       Voluntary Context Switches        11408
       Involuntary Context Switches      3249
       Block Input Operations            456
       Block Output Operations           618104

As expected, SQL uses quite a bit of memory - but not an excessive amount.

Using a MERGE statement

In order the use a MERGE statement both data sets must be either sorted or indexed on the join key (i.e., the BY variable). So this code snippet first sorts the data.

Again, the SAS Log follows.

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         proc sort data=dw.games out=games;
 74          by game_sk;
 75         run;
 
 NOTE: There were 2880 observations read from the data set DW.GAMES.
 NOTE: The data set WORK.GAMES has 2880 observations and 9 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           0.23 seconds
       user cpu time       0.01 seconds
       system cpu time     0.02 seconds
       memory              1499.56k
       OS Memory           31404.00k
       Timestamp           01/01/2019 03:35:21 PM
       Step Count                        47  Switch Count  2
       Page Faults                       1
       Page Reclaims                     155
       Page Swaps                        0
       Voluntary Context Switches        41
       Involuntary Context Switches      11
       Block Input Operations            688
       Block Output Operations           528
 
 76         
 77         proc sort data=dw.pitches out=pitches;
 78          by game_sk;
 79         run;
 
 NOTE: There were 875764 observations read from the data set DW.PITCHES.
 NOTE: The data set WORK.PITCHES has 875764 observations and 16 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           1.56 seconds
       user cpu time       0.32 seconds
       system cpu time     0.75 seconds
       memory              167312.82k
       OS Memory           197144.00k
       Timestamp           01/01/2019 03:35:23 PM
       Step Count                        48  Switch Count  7
       Page Faults                       4
       Page Reclaims                     41002
       Page Swaps                        0
       Voluntary Context Switches        12076
       Involuntary Context Switches      6594
       Block Input Operations            824
       Block Output Operations           247192
  
 80         
 81         data dsMerge;
 82          merge Pitches(in=pitches) Games;
 83          by game_sk;
 84          if pitches;
 85         run;
 
 NOTE: There were 875764 observations read from the data set WORK.PITCHES.
 NOTE: There were 2880 observations read from the data set WORK.GAMES.
 NOTE: The data set WORK.DSMERGE has 875764 observations and 24 variables.
 NOTE: DATA statement used (Total process time):
       real time           1.52 seconds
       user cpu time       0.19 seconds
       system cpu time     0.21 seconds
       memory              1064.93k
       OS Memory           31904.00k
       Timestamp           01/01/2019 03:35:24 PM
       Step Count                        49  Switch Count  5
       Page Faults                       0
       Page Reclaims                     78
       Page Swaps                        0
       Voluntary Context Switches        105
       Involuntary Context Switches      28
       Block Input Operations            0
       Block Output Operations           357272

It is worth noting that using the MERGE statement in the DATA step uses less time and far less memory than the SQL step. Again, that is not surprising. And the time difference is even more dramatic if the sort steps are not needed.

A Hash Merge

The use of the hash object to perform table lookup operations is a widely accepted technique. In fact, a left join is simply a table lookup that is not restricted to one result field (i.e., as in using a format with the put function).

The following code implements a left join. The SQL step to create a macro variable whose value is the list of fields to be obtained (i.e., looked up) from the GAMES data set is created so it can be used in a CALL MISSING statement to handle the contingency of no match. We don't want the PDV host variables from the last successfull FIND operation to be used.

A DoW loop is used to read the data. Since the DATA step only executes once, there is no need to embed the code defining the hash table in an IF _N_=1 condition.

Note that the hash merge uses less time than a DATA step MERGE (assuming the sorts are needed), but it does use more memory - again, as expected. Note however that it uses much less memory than the SQL step - roughly 1/3rd the memory. In addition, as discussed in section 11.5: Data Portion (Hash Index) and 11.5.1: Joining Data of the book, the amount of memory needed by a hash merge can be substantially reduced by storing a record key or pointer in the hash table instead of the full set of data columns.

The hash merge has another possible advantage over both the SQL and MERGE approaches - no sorting is needed (while not obvious, SQL often has to sort the data - we just don't see that explicitly). This can be particularly valuable if the original order of the main data sets (in this case PITCHES) is to be preserved.

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         proc sql noprint;
 74          select name into:gamesColumns separated by ", "
 75          from dictionary.columns
 76          where libname = "DW"
 77            and memname = "GAMES"
 78            and upcase(name) ne "GAME_SK"
 79          ;
 80         quit;
 NOTE: PROCEDURE SQL used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              5317.18k
       OS Memory           35236.00k
       Timestamp           01/01/2019 03:47:58 PM
       Step Count                        68  Switch Count  0
       Page Faults                       0
       Page Reclaims                     47
       Page Swaps                        0
       Voluntary Context Switches        13
       Involuntary Context Switches      1
       Block Input Operations            0
       Block Output Operations           0
  
 81         
 82         data hashMerge;
 83          dcl hash games(dataset:"dw.games");
 84          games.defineKey("Game_SK");
 85          games.defineData(all:"yes");
 86          games.defineDone();
 87          do until(lr);
 88             set dw.pitches end=lr;
 89             call missing(&gamesColumns);
 90             games.find();
 91             output;
 92          end;
 93          stop;
 94          if 0 then set dw.games;
 95         run;
 
 NOTE: There were 2880 observations read from the data set DW.GAMES.
 NOTE: There were 875764 observations read from the data set DW.PITCHES.
 NOTE: The data set WORK.HASHMERGE has 875764 observations and 24 variables.
 NOTE: DATA statement used (Total process time):
       real time           2.63 seconds
       user cpu time       0.12 seconds
       system cpu time     0.75 seconds
       memory              1674.57k
       OS Memory           31148.00k
       Timestamp           01/01/2019 03:48:01 PM
       Step Count                        69  Switch Count  5
       Page Faults                       3
       Page Reclaims                     433
       Page Swaps                        0
       Voluntary Context Switches        6435
       Involuntary Context Switches      818
       Block Input Operations            512
       Block Output Operations           357872

Comparing the Result Sets

It goes without saying that it is necessary to confirm that these three techniques all produce the same result sets - allowing for differences in the column and row order. The following code was used to verify that all three of these produced the same result set and is included in the attached sample program.

Since both the SQL and MERGE results were the same as the hash results, the transitive property of comparing sets means we need not compare the SQL and MERGE results directly.

 /* the order of the rows and columns may be different, sort on unique keys
    so PROC COMPARE can be used to validate the data are the same.
 */
 proc sort data=sqlmerge out=sqlmerge;
  by game_sk inning top_bot ab_number pitch_number;
 run;
 
 proc sort data=dsmerge out=dsmerge;
  by game_sk inning top_bot ab_number pitch_number;
 run;
 
 proc sort data=hashmerge out=hashmerge;
  by game_sk inning top_bot ab_number pitch_number;
 run;
 
 proc compare data=sqlMerge compare=hashMerge;
 run;
 
 proc compare data=dsMerge compare=hashMerge;
 run;

An Inner or Equi-Join Alternative

We choose to initially present this example as a left join in order to facilitate the analogy to table lookup (as mentioned in the hash merge discussion). In point of fact, the two data sets we joined/merged have exactly the same set of matching keys. So we could have done an inner join. The following log show the results of a inner join using SQL, with an additional option (_METHOD) to tell us how SQL implemented the join.

Of particular interest is that for this join it appears that the SQL optimizer has decided to use a hash method. As a result, it less time and memory - with values comparable to the hash merge approach. The use of sqxjhsh method only applies to equi/inner joins and can depend on the value of other options (e.g., buffersize as well as the magic parameter). That is a topic for another day.

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         proc sql noprint _method;
 74          /* Illustrate that an inner/equijoin actually uses a hash method */
 75          select catx('.',memname,name) into:gamesColumns separated by ", "
 76          from dictionary.columns
 77          where libname = "DW"
 78            and memname = "GAMES"
 79            and upcase(name) ne "GAME_SK"
 80          ;
 NOTE: SQL execution methods chosen are:
 
       sqxslct
           sqxfil
               sqxsob( DICTIONARY.COLUMNS )


 81          create table sqlMerge as
 82           select pitches.*
 83                 ,&gamesColumns
 84           from dw.pitches
 85               ,dw.games
 86           where games.game_sk = pitches.game_sk
 87          ;
 NOTE: SQL execution methods chosen are:
 
       sqxcrta
           sqxjhsh
               sqxsrc( DW.PITCHES )
               sqxsrc( DW.GAMES )
 NOTE: Table WORK.SQLMERGE created, with 875764 rows and 24 columns.
 
 88         quit;
 NOTE: PROCEDURE SQL used (Total process time):
       real time           2.56 seconds
       user cpu time       0.08 seconds
       system cpu time     0.76 seconds
       memory              5732.96k
       OS Memory           35492.00k
       Timestamp           01/01/2019 03:41:08 PM
       Step Count                        62  Switch Count  4
       Page Faults                       4
       Page Reclaims                     278
       Page Swaps                        0
       Voluntary Context Switches        6794
       Involuntary Context Switches      2572
       Block Input Operations            552
       Block Output Operations           357384

Before concluding, we wanted to share the hash merge code for an equi/inner join. It is very similar to code presented above. The differences are:

There is no need to use CALL MISSING since we are only keeping matches (i.e., the FIND method updated all the PDV host variables).
As a result, the SQL step to create the macro variable whose value is used in the CALL MISSING is not needed.
The OUTPUT statement is now dependent on the return code from the FIND method.

The performance is similar (as expected) to the hash merge left join as well as the SQL equi/inner join. And as in the above examples, the hash merge uses roughly 1/3rd the memory of the sqxjhsh method.

The SAS log follows.

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         data hashMerge;
 74          /* hash merge inner join */
 75          dcl hash games(dataset:"dw.games");
 76          games.defineKey("Game_SK");
 77          games.defineData(all:"yes");
 78          games.defineDone();
 79          do until(lr);
 80             set dw.pitches end=lr;
 81             if games.find() = 0 then output;
 82          end;
 83          stop;
 84          if 0 then set dw.games;
 85         run;
 
 NOTE: There were 2880 observations read from the data set DW.GAMES.
 NOTE: There were 875764 observations read from the data set DW.PITCHES.
 NOTE: The data set WORK.HASHMERGE has 875764 observations and 24 variables.
 NOTE: DATA statement used (Total process time):
       real time           2.79 seconds
       user cpu time       0.10 seconds
       system cpu time     0.80 seconds
       memory              1700.07k
       OS Memory           30380.00k
       Timestamp           01/01/2019 08:21:35 PM
       Step Count                        58  Switch Count  5
       Page Faults                       3
       Page Reclaims                     535
       Page Swaps                        0
       Voluntary Context Switches        6548
       Involuntary Context Switches      25
       Block Input Operations            504
       Block Output Operations           359032

Conclusion

Performance comparisons are always pretty tricky. As stated above the results always depend on a lot of factors. The above is just one comparison for one set of data. So take that for what it is worth.

Feel free to download the sample program and run your own comparisons in your environment. Please post any results you find to be interesting in the comments. Likewise, feel free to modify the program on your own data.

Suffice it to say that there are lots of other options and techniques that could be leveraged to address joining data efficiently. As mentioned above, please let us know your thoughts and perhaps that will lead to a more detailed article about those techniques.

ChrisNZ · ‎01-21-2019

Nice article!

I would have expected a hash table in a SQL step to be faster than in a data step as it is not user-managed, but rather coded in a less flexible but (hopefully) more optimised manner. Compiled should be faster than interpreted.

This article [study on the best method to join two tables] might also interest you.

And maybe this one too [speeding up a 2-table join].