Assessing performance is always difficult. There are so many variables that can impact performance that generic answers as to which approaches are better, the answer is almost always IT DEPENDS. An answer that virtually no one likes!
That is why when Paul (aka @hashman) and I (aka @DonH) worked on our SAS Press Book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study, we did not try to compare performance of the hash object with alternative techniques.
Tackling this question here for a specific case study is different for the simple reason that we can engage in a Q&A thanks to the facilities of the communities site. In this article we provide an example of comparing SQL, the MERGE statement and the hash object to merge/join two data files. We specifically chose what is referred in in SQL as a left join (more on that later). It is worth noting here that a left join becomes a right join merely by reversing the order of the table listed. Specifically, the following two joins in SQL produce the same output data sets (acknowledging that the order of the variables might be different):
Likewise, if a MERGE statement is used, these produce the same output results (acknowledging that the order of the variables might be different):
For the purposes of this comparison, we ran the following examples using SAS University Edition on Windows 10. The program is attached to this article so readers can download it and run the comparisons in their environment. The datasets used are available in the downloadable zip file for our book.
The SAS log follows.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 72 73 options fullstimer; 74 75 proc sql noprint; 76 create table sqlMerge as 77 select pitches.* 78 ,games.* 79 from dw.pitches 80 left join 81 dw.games 82 on games.game_sk = pitches.game_sk 83 ; WARNING: Variable Game_SK already exists on file WORK.SQLMERGE. NOTE: Table WORK.SQLMERGE created, with 875764 rows and 24 columns. 84 quit; NOTE: PROCEDURE SQL used (Total process time): real time 4.73 seconds user cpu time 0.61 seconds system cpu time 1.06 seconds memory 160496.07k OS Memory 189152.00k Timestamp 01/01/2019 03:31:53 PM Step Count 35 Switch Count 4 Page Faults 7 Page Reclaims 38727 Page Swaps 0 Voluntary Context Switches 8251 Involuntary Context Switches 1304 Block Input Operations 1024 Block Output Operations 618032
@DonH is obsessed with eliminating WARNING messages - even ones that are expected and don't indicate a problem. If there are expected WARNING, or ERROR messages, it is easy to miss such messages that are not expected. So the above SQL code has been revised to not use games.*, instead we create a macro variable that is a SQL compliant list of the variables (excluding the join key). This adds a bit of time; but having a WARNING free log is worth it.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 72 73 proc sql noprint; 74 /* add a dictionary.columns table to create the list of variables, exlcuding the duplicate key 75 in order to eliminate the duplicate column name message 76 */ 77 select catx('.',memname,name) into:gamesColumns separated by ", " 78 from dictionary.columns 79 where libname = "DW" 80 and memname = "GAMES" 81 and upcase(name) ne "GAME_SK" 82 ; 83 create table sqlMerge as 84 select pitches.* 85 ,&gamesColumns 86 from dw.pitches 87 left join 88 dw.games 89 on games.game_sk = pitches.game_sk 90 ; NOTE: Table WORK.SQLMERGE created, with 875764 rows and 24 columns. 91 quit; NOTE: PROCEDURE SQL used (Total process time): real time 5.83 seconds user cpu time 0.52 seconds system cpu time 1.24 seconds memory 160445.87k OS Memory 189920.00k Timestamp 01/01/2019 03:33:12 PM Step Count 41 Switch Count 5 Page Faults 2 Page Reclaims 39037 Page Swaps 0 Voluntary Context Switches 11408 Involuntary Context Switches 3249 Block Input Operations 456 Block Output Operations 618104
As expected, SQL uses quite a bit of memory - but not an excessive amount.
In order the use a MERGE statement both data sets must be either sorted or indexed on the join key (i.e., the BY variable). So this code snippet first sorts the data.
Again, the SAS Log follows.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 72 73 proc sort data=dw.games out=games; 74 by game_sk; 75 run; NOTE: There were 2880 observations read from the data set DW.GAMES. NOTE: The data set WORK.GAMES has 2880 observations and 9 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.23 seconds user cpu time 0.01 seconds system cpu time 0.02 seconds memory 1499.56k OS Memory 31404.00k Timestamp 01/01/2019 03:35:21 PM Step Count 47 Switch Count 2 Page Faults 1 Page Reclaims 155 Page Swaps 0 Voluntary Context Switches 41 Involuntary Context Switches 11 Block Input Operations 688 Block Output Operations 528 76 77 proc sort data=dw.pitches out=pitches; 78 by game_sk; 79 run; NOTE: There were 875764 observations read from the data set DW.PITCHES. NOTE: The data set WORK.PITCHES has 875764 observations and 16 variables. NOTE: PROCEDURE SORT used (Total process time): real time 1.56 seconds user cpu time 0.32 seconds system cpu time 0.75 seconds memory 167312.82k OS Memory 197144.00k Timestamp 01/01/2019 03:35:23 PM Step Count 48 Switch Count 7 Page Faults 4 Page Reclaims 41002 Page Swaps 0 Voluntary Context Switches 12076 Involuntary Context Switches 6594 Block Input Operations 824 Block Output Operations 247192 80 81 data dsMerge; 82 merge Pitches(in=pitches) Games; 83 by game_sk; 84 if pitches; 85 run; NOTE: There were 875764 observations read from the data set WORK.PITCHES. NOTE: There were 2880 observations read from the data set WORK.GAMES. NOTE: The data set WORK.DSMERGE has 875764 observations and 24 variables. NOTE: DATA statement used (Total process time): real time 1.52 seconds user cpu time 0.19 seconds system cpu time 0.21 seconds memory 1064.93k OS Memory 31904.00k Timestamp 01/01/2019 03:35:24 PM Step Count 49 Switch Count 5 Page Faults 0 Page Reclaims 78 Page Swaps 0 Voluntary Context Switches 105 Involuntary Context Switches 28 Block Input Operations 0 Block Output Operations 357272
It is worth noting that using the MERGE statement in the DATA step uses less time and far less memory than the SQL step. Again, that is not surprising. And the time difference is even more dramatic if the sort steps are not needed.
The use of the hash object to perform table lookup operations is a widely accepted technique. In fact, a left join is simply a table lookup that is not restricted to one result field (i.e., as in using a format with the put function).
The following code implements a left join. The SQL step to create a macro variable whose value is the list of fields to be obtained (i.e., looked up) from the GAMES data set is created so it can be used in a CALL MISSING statement to handle the contingency of no match. We don't want the PDV host variables from the last successfull FIND operation to be used.
A DoW loop is used to read the data. Since the DATA step only executes once, there is no need to embed the code defining the hash table in an IF _N_=1 condition.
Note that the hash merge uses less time than a DATA step MERGE (assuming the sorts are needed), but it does use more memory - again, as expected. Note however that it uses much less memory than the SQL step - roughly 1/3rd the memory. In addition, as discussed in section 11.5: Data Portion (Hash Index) and 11.5.1: Joining Data of the book, the amount of memory needed by a hash merge can be substantially reduced by storing a record key or pointer in the hash table instead of the full set of data columns.
The hash merge has another possible advantage over both the SQL and MERGE approaches - no sorting is needed (while not obvious, SQL often has to sort the data - we just don't see that explicitly). This can be particularly valuable if the original order of the main data sets (in this case PITCHES) is to be preserved.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 72 73 proc sql noprint; 74 select name into:gamesColumns separated by ", " 75 from dictionary.columns 76 where libname = "DW" 77 and memname = "GAMES" 78 and upcase(name) ne "GAME_SK" 79 ; 80 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 5317.18k OS Memory 35236.00k Timestamp 01/01/2019 03:47:58 PM Step Count 68 Switch Count 0 Page Faults 0 Page Reclaims 47 Page Swaps 0 Voluntary Context Switches 13 Involuntary Context Switches 1 Block Input Operations 0 Block Output Operations 0 81 82 data hashMerge; 83 dcl hash games(dataset:"dw.games"); 84 games.defineKey("Game_SK"); 85 games.defineData(all:"yes"); 86 games.defineDone(); 87 do until(lr); 88 set dw.pitches end=lr; 89 call missing(&gamesColumns); 90 games.find(); 91 output; 92 end; 93 stop; 94 if 0 then set dw.games; 95 run; NOTE: There were 2880 observations read from the data set DW.GAMES. NOTE: There were 875764 observations read from the data set DW.PITCHES. NOTE: The data set WORK.HASHMERGE has 875764 observations and 24 variables. NOTE: DATA statement used (Total process time): real time 2.63 seconds user cpu time 0.12 seconds system cpu time 0.75 seconds memory 1674.57k OS Memory 31148.00k Timestamp 01/01/2019 03:48:01 PM Step Count 69 Switch Count 5 Page Faults 3 Page Reclaims 433 Page Swaps 0 Voluntary Context Switches 6435 Involuntary Context Switches 818 Block Input Operations 512 Block Output Operations 357872
It goes without saying that it is necessary to confirm that these three techniques all produce the same result sets - allowing for differences in the column and row order. The following code was used to verify that all three of these produced the same result set and is included in the attached sample program.
Since both the SQL and MERGE results were the same as the hash results, the transitive property of comparing sets means we need not compare the SQL and MERGE results directly.
/* the order of the rows and columns may be different, sort on unique keys so PROC COMPARE can be used to validate the data are the same. */ proc sort data=sqlmerge out=sqlmerge; by game_sk inning top_bot ab_number pitch_number; run; proc sort data=dsmerge out=dsmerge; by game_sk inning top_bot ab_number pitch_number; run; proc sort data=hashmerge out=hashmerge; by game_sk inning top_bot ab_number pitch_number; run; proc compare data=sqlMerge compare=hashMerge; run; proc compare data=dsMerge compare=hashMerge; run;
We choose to initially present this example as a left join in order to facilitate the analogy to table lookup (as mentioned in the hash merge discussion). In point of fact, the two data sets we joined/merged have exactly the same set of matching keys. So we could have done an inner join. The following log show the results of a inner join using SQL, with an additional option (_METHOD) to tell us how SQL implemented the join.
Of particular interest is that for this join it appears that the SQL optimizer has decided to use a hash method. As a result, it less time and memory - with values comparable to the hash merge approach. The use of sqxjhsh method only applies to equi/inner joins and can depend on the value of other options (e.g., buffersize as well as the magic parameter). That is a topic for another day.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 72 73 proc sql noprint _method; 74 /* Illustrate that an inner/equijoin actually uses a hash method */ 75 select catx('.',memname,name) into:gamesColumns separated by ", " 76 from dictionary.columns 77 where libname = "DW" 78 and memname = "GAMES" 79 and upcase(name) ne "GAME_SK" 80 ; NOTE: SQL execution methods chosen are: sqxslct sqxfil sqxsob( DICTIONARY.COLUMNS )
81 create table sqlMerge as 82 select pitches.* 83 ,&gamesColumns 84 from dw.pitches 85 ,dw.games 86 where games.game_sk = pitches.game_sk 87 ; NOTE: SQL execution methods chosen are: sqxcrta sqxjhsh sqxsrc( DW.PITCHES ) sqxsrc( DW.GAMES ) NOTE: Table WORK.SQLMERGE created, with 875764 rows and 24 columns. 88 quit; NOTE: PROCEDURE SQL used (Total process time): real time 2.56 seconds user cpu time 0.08 seconds system cpu time 0.76 seconds memory 5732.96k OS Memory 35492.00k Timestamp 01/01/2019 03:41:08 PM Step Count 62 Switch Count 4 Page Faults 4 Page Reclaims 278 Page Swaps 0 Voluntary Context Switches 6794 Involuntary Context Switches 2572 Block Input Operations 552 Block Output Operations 357384
Before concluding, we wanted to share the hash merge code for an equi/inner join. It is very similar to code presented above. The differences are:
The performance is similar (as expected) to the hash merge left join as well as the SQL equi/inner join. And as in the above examples, the hash merge uses roughly 1/3rd the memory of the sqxjhsh method.
The SAS log follows.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 72 73 data hashMerge; 74 /* hash merge inner join */ 75 dcl hash games(dataset:"dw.games"); 76 games.defineKey("Game_SK"); 77 games.defineData(all:"yes"); 78 games.defineDone(); 79 do until(lr); 80 set dw.pitches end=lr; 81 if games.find() = 0 then output; 82 end; 83 stop; 84 if 0 then set dw.games; 85 run; NOTE: There were 2880 observations read from the data set DW.GAMES. NOTE: There were 875764 observations read from the data set DW.PITCHES. NOTE: The data set WORK.HASHMERGE has 875764 observations and 24 variables. NOTE: DATA statement used (Total process time): real time 2.79 seconds user cpu time 0.10 seconds system cpu time 0.80 seconds memory 1700.07k OS Memory 30380.00k Timestamp 01/01/2019 08:21:35 PM Step Count 58 Switch Count 5 Page Faults 3 Page Reclaims 535 Page Swaps 0 Voluntary Context Switches 6548 Involuntary Context Switches 25 Block Input Operations 504 Block Output Operations 359032
Performance comparisons are always pretty tricky. As stated above the results always depend on a lot of factors. The above is just one comparison for one set of data. So take that for what it is worth.
Feel free to download the sample program and run your own comparisons in your environment. Please post any results you find to be interesting in the comments. Likewise, feel free to modify the program on your own data.
Suffice it to say that there are lots of other options and techniques that could be leveraged to address joining data efficiently. As mentioned above, please let us know your thoughts and perhaps that will lead to a more detailed article about those techniques.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.