BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rowlinglu
Fluorite | Level 6

@hashman Paul,

 

    Thank you so much, this algorithm is a lot faster when I test it on a sample dataset, and make more sense to me, too! I will accept this as answer

 

Best,

Ruolin

   

hashman
Ammonite | Level 13

@rowlinglu Ruolin:

 

I appreciate it but I have to say this: What I offered you wasn't intended as a solution but an illustration of how MERGE would work in your case since you had asked about it. If it happens to work as a solution to your problem, that's fine, too. However, the solutions offered by @PGStats and @ChrisNZ (and intended as solutions) are in fact better and faster than MERGE. This is because MERGE needs sorting, and it's a lot of work and resource usage for SAS in case of large files. 

 

At the same time, @ChrisNZ's program doesn't have to sort at all due to the nature of his algorithm. @PGStats' program most likely doesn't sort, either, by internal default, for if the lookup file's keys and data can fit in memory, the SQL optimizer will opt for the SQXJHSH access method. If it happens to sort behind-the-scenes, you can see it by adding the _METHOD option to the SQL statement and turning the system option MSGLEVEL=I on; if you see the SQXJSRT access method reported in the log, then sorting is done on the file indicated. But SQL can be forced to avoid the sorting and use an internal hash table instead by coding the option MAGIC=103 with the SQL statement. 

 

Bottom line, if I were in your shoes, I'd rather accept the solutions other than MERGE. I don't know if the policy of this site allows only one offer to be chosen as a solution. If it's the case, it's flat wrong because there's almost never such a thing in SAS as only one best solution to a given problem, except in cases when the OP specifically wants the code to be the shortest, the simplest, the fastest, or anything else labeled as "est". The people learning how to do things in SAS here benefit much more from a variety of solutions than from the (misguided) perception that if a solution is accepted, then this is the only one correct and acceptable. 

 

Just my $.02.

Kind regards, Paul D

rowlinglu
Fluorite | Level 6

@hashman Thank you very much for pointing that out. I don't think there is a way to accept multiple solutions. I chose MERGE because it turned out my dataset of already sorted beforehand and MERGE seems to work quickly. However, you are right, for other people who are seeing this question in the future, they will probably need to sort beforehand, and it's possible that they only look at the "solution". I will compare the algorithm and select a solution later. Thanks!

DonH
Lapis Lazuli | Level 10

I would like to reinforce @hashman's point about evaluating alternative techniques. Check out the communities article Performance - Comparing SQL, MERGE and the Hash Object to Join/Merge SAS Tables which provides a comparison of Merge, SQL and the Hash object for a typical merge.

If you are processing large files and the code is to run repeatedly, it is probably worth the time and effort to evaluate all three of these approaches.

ChrisNZ
Tourmaline | Level 20

@rowlinglu 

Here is another Communities article Study on the best method to join two tables on the same topic, comparing four join methods :
    - Sort and Merge 
    - Index + SET with option KEY = 
    - Index + load in memory with SASFILE + set with option KEY=
    - Hash table

if you are interested.

hashman
Ammonite | Level 13

@ChrisNZ :

I've read your article ... and the comments. Thanks 1E6 for writing it - lots of work, well thought out and well presented. Methinks its greatest benefit is presenting a number of techniques, side by side, to expand the horizons of those who have used only one since they know how to use it. 

 

As far as the performance comparisons are concerned per se, unfortunately, they're always circumstantial, as they can't account for a variety of different situations involving many confounding factors. The number/lengths of key components, the number/lengths of satellite variables, absolute and relative sizes of the files, number of matching keys, distribution of available system resources (e.g. disk vs memory), execution time limit vs shared resources limits - all come into play. The responsibility of a high-performance SAS programmer is to take all of those into consideration in order to find an optimal solution by using one's experience, the knowledge of different algorithms/techniques, a dose of heuristics, circumstantial business knowledge of the data, etc. .. well, you know it all too well.    

 

As a little bit of example (purely for illustration rather than for any far-drawn conclusions), in your test case, you're joining on a single integer key. That, alone, represents a good deal of test skewness, as it can make the relative cost of table lookup very small compared to the rest of the task components - provided that an appropriate search algorithm is selected to take advantage of the integer nature of the key. None of the methods you've presented would make use of this integer key property; and yet it's possible since it makes the simplest hash function mod(key,tablesize) the fastest. For instance, try this (the outputs are _NULL_ified to make the comparo purer):

data a (keep = k) b ;                                                 
  do d = 1 to 1e7 ;                                                   
    k = floor (ranuni (1) * 1e15) ;                                   
    output a ;                                                        
    if d <= 1e7 / 2 then output b ;                                   
  end ;                                                               
run ;                                                                 
                                                                      
%let h =12100003 ; * first prime > 1.2E7 ;                                                     
data _null_ /*array_hash*/ ;                                          
  array hk [&h] _temporary_ ;                                         
  array hd [&h] _temporary_ ;                                         
  if _n_ = 1 then do until (z) ;                                      
    set b end = z ;                                                   
    link hash ;                                                       
    hk[_n_] = k ;                                                     
    hd[_n_] = d ;                                                     
  end ;                                                               
  set a ;                                                             
  link hash ;                                                         
  if hk[_n_] ne . ;                                                   
  d = hd[_n_] ;                                                       
  return ;                                                            
  hash: do _n_ = 1 + mod (k, &h) by 1 until (hk[_n_] = k or hk[_n_] = .) ;
          if _n_ > &h then _n_ = 1 ;                                  
        end ;                                                         
  return ;                                                            
run ;                                                                 
                                                                      
data _null_ /*canned_hash*/ ;                                         
  if _n_ = 1 then do ;                                                
    if 0 then set b ;                                                 
    dcl hash h (dataset:"b") ;                                        
    h.definekey ("k") ;                                               
    h.definedata ("d") ;                                              
    h.definedone () ;                                                 
  end ;                                                               
  set a ;                                                             
  if h.find() = 0 ;                                                   
run ;                                                                 
                                                                      
proc sort data = a out = sa ;                                         
  by k ;                                                              
run ;                                                                 
proc sort data = b out = sb;                                          
  by k ;                                                              
run ;                                                                 
data _null_ /*sort_merge*/ ;                                          
  merge sa (in = a) sb ;                                              
  by k ;                                                              
  if a ;                                                              
run ;                                                                 

Note that the input data arrangement much favors the sort-merge (the files are long and narrow), and even then in combined run time the array-based hash (with the most primitive collision resolution by linear probing) is 33% faster (not to mention the canned hash blown out of the water). Also note that the array-based hash uses 0.4 and 0.5 the memory used by the sort and canned hash, respectively.        

 

Kind regards

Paul D.

ChrisNZ
Tourmaline | Level 20

@hashman The tests shown are mainly for illustration purposes as you rightly mention. Each circumstance is different.

 

There are more examples for this same join in the book, and the SC article only shows the best methods *for the data*  and  *for the result*  and *for the criteria* (lowest elapse time) and *for the hardware*. The variations are endless. Is memory scarce? Is CPU expensive? Should the output be sorted? Is the data in SPDE tables where we could SPDE's fantastic sort-on-the-fly feature? And on and on...

 

> the files are long and narrow

I've tried to cover various scenarios by changing the number of columns retrieved (up to 50, so not that narrow), but once again (chorus!) each case is different.

 

> Thanks 1E6 for writing it

Smiley Very Happy

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 36 replies
  • 2881 views
  • 11 likes
  • 5 in conversation