Assessing performance is always difficult. There are so many variables that can impact performance that generic answers as to which approaches are better, the answer is almost always IT DEPENDS. An answer that virtually no one likes!
That is why when Paul (aka @hashman) and I (aka @DonH) worked on our SAS Press Book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study, we did not try to compare performance of the hash object with alternative techniques.
Tackling this question here for a specific case study is different for the simple reason that we can engage in a Q&A thanks to the facilities of the communities site. In this article we provide an example of comparing SQL, the MERGE statement and the hash object to merge/join two data files. We specifically chose what is referred in in SQL as a left join (more on that later). It is worth noting here that a left join becomes a right join merely by reversing the order of the table listed. Specifically, the following two joins in SQL produce the same output data sets (acknowledging that the order of the variables might be different):
Likewise, if a MERGE statement is used, these produce the same output results (acknowledging that the order of the variables might be different):
For the purposes of this comparison, we ran the following examples using SAS University Edition on Windows 10. The program is attached to this article so readers can download it and run the comparisons in their environment. The datasets used are available in the downloadable zip file for our book.
The SAS log follows.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 options fullstimer;
74
75 proc sql noprint;
76 create table sqlMerge as
77 select pitches.*
78 ,games.*
79 from dw.pitches
80 left join
81 dw.games
82 on games.game_sk = pitches.game_sk
83 ;
WARNING: Variable Game_SK already exists on file WORK.SQLMERGE.
NOTE: Table WORK.SQLMERGE created, with 875764 rows and 24 columns.
84 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 4.73 seconds
user cpu time 0.61 seconds
system cpu time 1.06 seconds
memory 160496.07k
OS Memory 189152.00k
Timestamp 01/01/2019 03:31:53 PM
Step Count 35 Switch Count 4
Page Faults 7
Page Reclaims 38727
Page Swaps 0
Voluntary Context Switches 8251
Involuntary Context Switches 1304
Block Input Operations 1024
Block Output Operations 618032
@DonH is obsessed with eliminating WARNING messages - even ones that are expected and don't indicate a problem. If there are expected WARNING, or ERROR messages, it is easy to miss such messages that are not expected. So the above SQL code has been revised to not use games.*, instead we create a macro variable that is a SQL compliant list of the variables (excluding the join key). This adds a bit of time; but having a WARNING free log is worth it.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 proc sql noprint;
74 /* add a dictionary.columns table to create the list of variables, exlcuding the duplicate key
75 in order to eliminate the duplicate column name message
76 */
77 select catx('.',memname,name) into:gamesColumns separated by ", "
78 from dictionary.columns
79 where libname = "DW"
80 and memname = "GAMES"
81 and upcase(name) ne "GAME_SK"
82 ;
83 create table sqlMerge as
84 select pitches.*
85 ,&gamesColumns
86 from dw.pitches
87 left join
88 dw.games
89 on games.game_sk = pitches.game_sk
90 ;
NOTE: Table WORK.SQLMERGE created, with 875764 rows and 24 columns.
91 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 5.83 seconds
user cpu time 0.52 seconds
system cpu time 1.24 seconds
memory 160445.87k
OS Memory 189920.00k
Timestamp 01/01/2019 03:33:12 PM
Step Count 41 Switch Count 5
Page Faults 2
Page Reclaims 39037
Page Swaps 0
Voluntary Context Switches 11408
Involuntary Context Switches 3249
Block Input Operations 456
Block Output Operations 618104
As expected, SQL uses quite a bit of memory - but not an excessive amount.
In order the use a MERGE statement both data sets must be either sorted or indexed on the join key (i.e., the BY variable). So this code snippet first sorts the data.
Again, the SAS Log follows.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 proc sort data=dw.games out=games;
74 by game_sk;
75 run;
NOTE: There were 2880 observations read from the data set DW.GAMES.
NOTE: The data set WORK.GAMES has 2880 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.23 seconds
user cpu time 0.01 seconds
system cpu time 0.02 seconds
memory 1499.56k
OS Memory 31404.00k
Timestamp 01/01/2019 03:35:21 PM
Step Count 47 Switch Count 2
Page Faults 1
Page Reclaims 155
Page Swaps 0
Voluntary Context Switches 41
Involuntary Context Switches 11
Block Input Operations 688
Block Output Operations 528
76
77 proc sort data=dw.pitches out=pitches;
78 by game_sk;
79 run;
NOTE: There were 875764 observations read from the data set DW.PITCHES.
NOTE: The data set WORK.PITCHES has 875764 observations and 16 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 1.56 seconds
user cpu time 0.32 seconds
system cpu time 0.75 seconds
memory 167312.82k
OS Memory 197144.00k
Timestamp 01/01/2019 03:35:23 PM
Step Count 48 Switch Count 7
Page Faults 4
Page Reclaims 41002
Page Swaps 0
Voluntary Context Switches 12076
Involuntary Context Switches 6594
Block Input Operations 824
Block Output Operations 247192
80
81 data dsMerge;
82 merge Pitches(in=pitches) Games;
83 by game_sk;
84 if pitches;
85 run;
NOTE: There were 875764 observations read from the data set WORK.PITCHES.
NOTE: There were 2880 observations read from the data set WORK.GAMES.
NOTE: The data set WORK.DSMERGE has 875764 observations and 24 variables.
NOTE: DATA statement used (Total process time):
real time 1.52 seconds
user cpu time 0.19 seconds
system cpu time 0.21 seconds
memory 1064.93k
OS Memory 31904.00k
Timestamp 01/01/2019 03:35:24 PM
Step Count 49 Switch Count 5
Page Faults 0
Page Reclaims 78
Page Swaps 0
Voluntary Context Switches 105
Involuntary Context Switches 28
Block Input Operations 0
Block Output Operations 357272
It is worth noting that using the MERGE statement in the DATA step uses less time and far less memory than the SQL step. Again, that is not surprising. And the time difference is even more dramatic if the sort steps are not needed.
The use of the hash object to perform table lookup operations is a widely accepted technique. In fact, a left join is simply a table lookup that is not restricted to one result field (i.e., as in using a format with the put function).
The following code implements a left join. The SQL step to create a macro variable whose value is the list of fields to be obtained (i.e., looked up) from the GAMES data set is created so it can be used in a CALL MISSING statement to handle the contingency of no match. We don't want the PDV host variables from the last successfull FIND operation to be used.
A DoW loop is used to read the data. Since the DATA step only executes once, there is no need to embed the code defining the hash table in an IF _N_=1 condition.
Note that the hash merge uses less time than a DATA step MERGE (assuming the sorts are needed), but it does use more memory - again, as expected. Note however that it uses much less memory than the SQL step - roughly 1/3rd the memory. In addition, as discussed in section 11.5: Data Portion (Hash Index) and 11.5.1: Joining Data of the book, the amount of memory needed by a hash merge can be substantially reduced by storing a record key or pointer in the hash table instead of the full set of data columns.
The hash merge has another possible advantage over both the SQL and MERGE approaches - no sorting is needed (while not obvious, SQL often has to sort the data - we just don't see that explicitly). This can be particularly valuable if the original order of the main data sets (in this case PITCHES) is to be preserved.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 proc sql noprint;
74 select name into:gamesColumns separated by ", "
75 from dictionary.columns
76 where libname = "DW"
77 and memname = "GAMES"
78 and upcase(name) ne "GAME_SK"
79 ;
80 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 5317.18k
OS Memory 35236.00k
Timestamp 01/01/2019 03:47:58 PM
Step Count 68 Switch Count 0
Page Faults 0
Page Reclaims 47
Page Swaps 0
Voluntary Context Switches 13
Involuntary Context Switches 1
Block Input Operations 0
Block Output Operations 0
81
82 data hashMerge;
83 dcl hash games(dataset:"dw.games");
84 games.defineKey("Game_SK");
85 games.defineData(all:"yes");
86 games.defineDone();
87 do until(lr);
88 set dw.pitches end=lr;
89 call missing(&gamesColumns);
90 games.find();
91 output;
92 end;
93 stop;
94 if 0 then set dw.games;
95 run;
NOTE: There were 2880 observations read from the data set DW.GAMES.
NOTE: There were 875764 observations read from the data set DW.PITCHES.
NOTE: The data set WORK.HASHMERGE has 875764 observations and 24 variables.
NOTE: DATA statement used (Total process time):
real time 2.63 seconds
user cpu time 0.12 seconds
system cpu time 0.75 seconds
memory 1674.57k
OS Memory 31148.00k
Timestamp 01/01/2019 03:48:01 PM
Step Count 69 Switch Count 5
Page Faults 3
Page Reclaims 433
Page Swaps 0
Voluntary Context Switches 6435
Involuntary Context Switches 818
Block Input Operations 512
Block Output Operations 357872
It goes without saying that it is necessary to confirm that these three techniques all produce the same result sets - allowing for differences in the column and row order. The following code was used to verify that all three of these produced the same result set and is included in the attached sample program.
Since both the SQL and MERGE results were the same as the hash results, the transitive property of comparing sets means we need not compare the SQL and MERGE results directly.
/* the order of the rows and columns may be different, sort on unique keys
so PROC COMPARE can be used to validate the data are the same.
*/
proc sort data=sqlmerge out=sqlmerge;
by game_sk inning top_bot ab_number pitch_number;
run;
proc sort data=dsmerge out=dsmerge;
by game_sk inning top_bot ab_number pitch_number;
run;
proc sort data=hashmerge out=hashmerge;
by game_sk inning top_bot ab_number pitch_number;
run;
proc compare data=sqlMerge compare=hashMerge;
run;
proc compare data=dsMerge compare=hashMerge;
run;
We choose to initially present this example as a left join in order to facilitate the analogy to table lookup (as mentioned in the hash merge discussion). In point of fact, the two data sets we joined/merged have exactly the same set of matching keys. So we could have done an inner join. The following log show the results of a inner join using SQL, with an additional option (_METHOD) to tell us how SQL implemented the join.
Of particular interest is that for this join it appears that the SQL optimizer has decided to use a hash method. As a result, it less time and memory - with values comparable to the hash merge approach. The use of sqxjhsh method only applies to equi/inner joins and can depend on the value of other options (e.g., buffersize as well as the magic parameter). That is a topic for another day.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 proc sql noprint _method;
74 /* Illustrate that an inner/equijoin actually uses a hash method */
75 select catx('.',memname,name) into:gamesColumns separated by ", "
76 from dictionary.columns
77 where libname = "DW"
78 and memname = "GAMES"
79 and upcase(name) ne "GAME_SK"
80 ;
NOTE: SQL execution methods chosen are:
sqxslct
sqxfil
sqxsob( DICTIONARY.COLUMNS )
81 create table sqlMerge as
82 select pitches.*
83 ,&gamesColumns
84 from dw.pitches
85 ,dw.games
86 where games.game_sk = pitches.game_sk
87 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxjhsh
sqxsrc( DW.PITCHES )
sqxsrc( DW.GAMES )
NOTE: Table WORK.SQLMERGE created, with 875764 rows and 24 columns.
88 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 2.56 seconds
user cpu time 0.08 seconds
system cpu time 0.76 seconds
memory 5732.96k
OS Memory 35492.00k
Timestamp 01/01/2019 03:41:08 PM
Step Count 62 Switch Count 4
Page Faults 4
Page Reclaims 278
Page Swaps 0
Voluntary Context Switches 6794
Involuntary Context Switches 2572
Block Input Operations 552
Block Output Operations 357384
Before concluding, we wanted to share the hash merge code for an equi/inner join. It is very similar to code presented above. The differences are:
The performance is similar (as expected) to the hash merge left join as well as the SQL equi/inner join. And as in the above examples, the hash merge uses roughly 1/3rd the memory of the sqxjhsh method.
The SAS log follows.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 data hashMerge;
74 /* hash merge inner join */
75 dcl hash games(dataset:"dw.games");
76 games.defineKey("Game_SK");
77 games.defineData(all:"yes");
78 games.defineDone();
79 do until(lr);
80 set dw.pitches end=lr;
81 if games.find() = 0 then output;
82 end;
83 stop;
84 if 0 then set dw.games;
85 run;
NOTE: There were 2880 observations read from the data set DW.GAMES.
NOTE: There were 875764 observations read from the data set DW.PITCHES.
NOTE: The data set WORK.HASHMERGE has 875764 observations and 24 variables.
NOTE: DATA statement used (Total process time):
real time 2.79 seconds
user cpu time 0.10 seconds
system cpu time 0.80 seconds
memory 1700.07k
OS Memory 30380.00k
Timestamp 01/01/2019 08:21:35 PM
Step Count 58 Switch Count 5
Page Faults 3
Page Reclaims 535
Page Swaps 0
Voluntary Context Switches 6548
Involuntary Context Switches 25
Block Input Operations 504
Block Output Operations 359032
Performance comparisons are always pretty tricky. As stated above the results always depend on a lot of factors. The above is just one comparison for one set of data. So take that for what it is worth.
Feel free to download the sample program and run your own comparisons in your environment. Please post any results you find to be interesting in the comments. Likewise, feel free to modify the program on your own data.
Suffice it to say that there are lots of other options and techniques that could be leveraged to address joining data efficiently. As mentioned above, please let us know your thoughts and perhaps that will lead to a more detailed article about those techniques.
Nice article!
I would have expected a hash table in a SQL step to be faster than in a data step as it is not user-managed, but rather coded in a less flexible but (hopefully) more optimised manner. Compiled should be faster than interpreted.
This article [study on the best method to join two tables] might also interest you.
And maybe this one too [speeding up a 2-table join].
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.