Solved: Re: SAS— Merge Data Based On subject ID and Date (Within and Across Fi... - Page 3

rowlinglu · Posted 08-08-2019 03:28 PM

Thank you so much, this algorithm is a lot faster when I test it on a sample dataset, and make more sense to me, too! I will accept this as answer

Best,

Ruolin

hashman · Posted 08-08-2019 11:21 PM

@rowlinglu Ruolin:

I appreciate it but I have to say this: What I offered you wasn't intended as a solution but an illustration of how MERGE would work in your case since you had asked about it. If it happens to work as a solution to your problem, that's fine, too. However, the solutions offered by @PGStats and @ChrisNZ (and intended as solutions) are in fact better and faster than MERGE. This is because MERGE needs sorting, and it's a lot of work and resource usage for SAS in case of large files.

At the same time, @ChrisNZ's program doesn't have to sort at all due to the nature of his algorithm. @PGStats' program most likely doesn't sort, either, by internal default, for if the lookup file's keys and data can fit in memory, the SQL optimizer will opt for the SQXJHSH access method. If it happens to sort behind-the-scenes, you can see it by adding the _METHOD option to the SQL statement and turning the system option MSGLEVEL=I on; if you see the SQXJSRT access method reported in the log, then sorting is done on the file indicated. But SQL can be forced to avoid the sorting and use an internal hash table instead by coding the option MAGIC=103 with the SQL statement.

Bottom line, if I were in your shoes, I'd rather accept the solutions other than MERGE. I don't know if the policy of this site allows only one offer to be chosen as a solution. If it's the case, it's flat wrong because there's almost never such a thing in SAS as only one best solution to a given problem, except in cases when the OP specifically wants the code to be the shortest, the simplest, the fastest, or anything else labeled as "est". The people learning how to do things in SAS here benefit much more from a variety of solutions than from the (misguided) perception that if a solution is accepted, then this is the only one correct and acceptable.

Just my $.02.

Kind regards, Paul D

rowlinglu · Posted 08-09-2019 10:22 AM

@hashman Thank you very much for pointing that out. I don't think there is a way to accept multiple solutions. I chose MERGE because it turned out my dataset of already sorted beforehand and MERGE seems to work quickly. However, you are right, for other people who are seeing this question in the future, they will probably need to sort beforehand, and it's possible that they only look at the "solution". I will compare the algorithm and select a solution later. Thanks!

DonH · Posted 08-10-2019 10:20 AM

I would like to reinforce @hashman's point about evaluating alternative techniques. Check out the communities article Performance - Comparing SQL, MERGE and the Hash Object to Join/Merge SAS Tables which provides a comparison of Merge, SQL and the Hash object for a typical merge.

If you are processing large files and the code is to run repeatedly, it is probably worth the time and effort to evaluate all three of these approaches.

ChrisNZ · Posted 08-12-2019 12:09 AM

@rowlinglu

Here is another Communities article Study on the best method to join two tables on the same topic, comparing four join methods :
    - Sort and Merge
    - Index + SET with option KEY =
    - Index + load in memory with SASFILE + set with option KEY=
    - Hash table

if you are interested.

High-Performance SAS Coding - Third Edition

hashman · Posted 08-12-2019 03:00 AM

@ChrisNZ :

I've read your article ... and the comments. Thanks 1E6 for writing it - lots of work, well thought out and well presented. Methinks its greatest benefit is presenting a number of techniques, side by side, to expand the horizons of those who have used only one since they know how to use it.

As far as the performance comparisons are concerned per se, unfortunately, they're always circumstantial, as they can't account for a variety of different situations involving many confounding factors. The number/lengths of key components, the number/lengths of satellite variables, absolute and relative sizes of the files, number of matching keys, distribution of available system resources (e.g. disk vs memory), execution time limit vs shared resources limits - all come into play. The responsibility of a high-performance SAS programmer is to take all of those into consideration in order to find an optimal solution by using one's experience, the knowledge of different algorithms/techniques, a dose of heuristics, circumstantial business knowledge of the data, etc. .. well, you know it all too well.

As a little bit of example (purely for illustration rather than for any far-drawn conclusions), in your test case, you're joining on a single integer key. That, alone, represents a good deal of test skewness, as it can make the relative cost of table lookup very small compared to the rest of the task components - provided that an appropriate search algorithm is selected to take advantage of the integer nature of the key. None of the methods you've presented would make use of this integer key property; and yet it's possible since it makes the simplest hash function mod(key,tablesize) the fastest. For instance, try this (the outputs are _NULL_ified to make the comparo purer):

data a (keep = k) b ;                                                 
  do d = 1 to 1e7 ;                                                   
    k = floor (ranuni (1) * 1e15) ;                                   
    output a ;                                                        
    if d <= 1e7 / 2 then output b ;                                   
  end ;                                                               
run ;                                                                 
                                                                      
%let h =12100003 ; * first prime > 1.2E7 ;                                                     
data _null_ /*array_hash*/ ;                                          
  array hk [&h] _temporary_ ;                                         
  array hd [&h] _temporary_ ;                                         
  if _n_ = 1 then do until (z) ;                                      
    set b end = z ;                                                   
    link hash ;                                                       
    hk[_n_] = k ;                                                     
    hd[_n_] = d ;                                                     
  end ;                                                               
  set a ;                                                             
  link hash ;                                                         
  if hk[_n_] ne . ;                                                   
  d = hd[_n_] ;                                                       
  return ;                                                            
  hash: do _n_ = 1 + mod (k, &h) by 1 until (hk[_n_] = k or hk[_n_] = .) ;
          if _n_ > &h then _n_ = 1 ;                                  
        end ;                                                         
  return ;                                                            
run ;                                                                 
                                                                      
data _null_ /*canned_hash*/ ;                                         
  if _n_ = 1 then do ;                                                
    if 0 then set b ;                                                 
    dcl hash h (dataset:"b") ;                                        
    h.definekey ("k") ;                                               
    h.definedata ("d") ;                                              
    h.definedone () ;                                                 
  end ;                                                               
  set a ;                                                             
  if h.find() = 0 ;                                                   
run ;                                                                 
                                                                      
proc sort data = a out = sa ;                                         
  by k ;                                                              
run ;                                                                 
proc sort data = b out = sb;                                          
  by k ;                                                              
run ;                                                                 
data _null_ /*sort_merge*/ ;                                          
  merge sa (in = a) sb ;                                              
  by k ;                                                              
  if a ;                                                              
run ;

Note that the input data arrangement much favors the sort-merge (the files are long and narrow), and even then in combined run time the array-based hash (with the most primitive collision resolution by linear probing) is 33% faster (not to mention the canned hash blown out of the water). Also note that the array-based hash uses 0.4 and 0.5 the memory used by the sort and canned hash, respectively.

Kind regards

Paul D.

ChrisNZ · Posted 08-12-2019 06:01 AM

@hashman The tests shown are mainly for illustration purposes as you rightly mention. Each circumstance is different.

There are more examples for this same join in the book, and the SC article only shows the best methods *for the data* and *for the result* and *for the criteria* (lowest elapse time) and *for the hardware*. The variations are endless. Is memory scarce? Is CPU expensive? Should the output be sorted? Is the data in SPDE tables where we could SPDE's fantastic sort-on-the-fly feature? And on and on...

> the files are long and narrow

I've tried to cover various scenarios by changing the number of columns retrieved (up to 50, so not that narrow), but once again (chorus!) each case is different.

> Thanks 1E6 for writing it