Solved: Re: SAS— Merge Data Based On subject ID and Date (Within and Across Fi... - Page 2

hashman · Posted 08-12-2019 05:43 AM

@ChrisNZ : Pleasure.

rowlinglu · Posted 08-11-2019 11:59 AM

In my understanding about hashman's response, I thought increasing variable length L will also increase size of hash item if it reaches certain limit, can you please explain why decreasing in variable length can make hash item jump to the next increment?

Thanks,

Ruolin

hashman · Posted 08-11-2019 09:06 PM

@rowlinglu:

" why decreasing in variable length can make hash item jump to the next increment?"

I don't think @ChrisNZ has said anything of the kind.

Judging from you last few posts, you must've developed certain misconceptions with regard to what I've named L and S, so:

L is the summary system length, in bytes, of the PDV host variables corresponding to the hash variables defined for the hash table
S is the number of memory bytes SAS actually uses for the hash table entry composed of the hash variables

Suppose we have the hash entry defined as:

h.defineKey ("Kc") ;

h.defineData ("Dc") ;

Let's assume that the host variables for the hash variables Kn and Dc are both defined (at compile time) as $1. Then L=1+1=2 bytes. In this case, on a 64-bit system,the hash object will grab S=48 bytes for the hash entry. If you keep increasing the length of either variable (or both), S will be kept at S=48 while length(Kc)+length(Dc)<=16, but at L=17 will become 64. However, L is not the only factor affecting S since, in addition to L, it also depends on:

The number of hash variables N: In general, the more variables (at the same L), the greater S can be. This kind of stands to reason, as the more hash variables you have, the more memory overhead is needed to keep track of them in the underlying AVL trees.
Variable data type T: In some cases (but not always), a numeric variable (8 bytes) increase S more than a character variable $8.

In sum, the behavior of the function S=F(L, N, T) is very hard to predict - if not impossible since as yet, I haven't seen anyone formulate its exact rules. What can be said about F definitively is that it's always a non-decreasing step-wise function of L and N; and as a rule of thumb, the shorter is L and the smaller is N, the smaller is the resulting S.

Also, the dependency on N, given the rest of the factors fixed, is quite strong. For example, if you define the key portion with 16 $1 variables K1-K16 and the data portion with a single $16 variable, i.e. with L=32, SAS will set S=176 (sic), with 176-32=144 bytes memory overhead per hash item. But if you concatenate K1-K16 into a single key K $16 (leaving the data portion intact), it will set S=64, with only 64-32=32 bytes of memory overhead per item. That's 144-32=112 fewer memory bytes wasted per item, which for a hash table with 10 million items saves over 1G of otherwise completely wasted memory.

It nicely dovetails into the idea of saving hash memory by using MD5 to replace the entire key portion containing many hash variables, i.e. a multi-component compound key, with a single 16-byte key. Per above, its benefits include:

Reducing the key portion summary length to 16 bytes regardless of the number and lengths of the original key components (as @DonH has already noted). Thus, the resulting L will always be only 16 plus the total length of the data portion variables.
Reducing N to 1 regardless of N.
Doing away with any effect numeric key components may have on S, as all of them are thus eliminated.

If all of these factors are present, they have a cumulative effect on the key portion's hash memory footprint; and the only expense is the need to concatenate the key components using CATX or CATQ and compute MD5 (albeit twice: at the time when the table is loaded and then when it is searched).

Now as far as the data portion is concerned, the MD5 method cannot be used to reduce the data portion memory footprint for the obvious reason that MD5 is a one-way hash function, and its argument cannot be reconstructed from its MD5 digest. However, you're dealing with a simple join, in which case the entire data portion can be reduced to a single numeric variable. To wit, instead of storing all the satellite variables from the file being joined in the table, you can store only their RID (record identification) number. Then on the search side, if a key is found, you can just use the POINT=RID option to retrieve the values of all the satellite variables you need into the PDV. This method is described in detail, for example, here:

https://www.lexjansen.com/nesug/nesug11/ld/ld01.pdf

In your specific case, it would mean (I'm using @ChrisNZ's code as a template):

data master_inpatient ;                                             
  input iID (iadmsn_dt ithru_dt) (:yymmdd10.) iprvdr_num $1. ;      
  format iadmsn_dt ithru_dt yymmdd10. ;                             
  cards ;                                                           
341   2013-04-01  2013-04-02    G                                   
230   2013-06-01  2013-06-03    I                                   
232   2013-07-31  2013-07-31    F                                   
124   2013-04-29  2013-04-29    C                                   
232   2013-07-31  2013-08-20    Q                                   
run ;                                                               
                                                                    
data filtered_outpatient ;                                          
  input ID thru_dt :yymmdd10. prvdr_num $1. ;                       
  format thru_dt yymmdd10. ;                                        
  cards ;                                                           
124   2013-04-29   A                                                
331   2013-06-14   G                                                
439   2013-02-01   B                                                
331   2013-06-14   D                                                
run ;                                                               
                                                                    
data want ;                                                         
  if _n_ = 1 then do ;                                              
    dcl hash h (multidata:"y");                                     
    h.definekey ("id") ;                                            
    h.definedata ("rid") ;                                          
    h.definedone () ;                                               
    do rid = 1 by 1 until (z) ;                                     
      set filtered_outpatient end = z ;                             
      h.add() ;                                                     
    end ;                                                           
  end ;                                                             
  set master_inpatient (rename = iID = ID) ;                        
  if h.find() = 0 then do until (h.find_next() ne 0) ;              
    set filtered_outpatient (keep = thru_dt prvdr_num) point = rid ;
    dif = abs (iadmsn_dt - thru_dt) ;                               
    if dif <= 1 then output ;                                       
  end ;                                                             
run ;

Note that since you have only 2 satellite variables here (thru_dt and prfdr_num), using this trick won't win you any prizes. But if, as you say, you have to deal with about 20 satellite variables to retrieve, it will make the code much more memory-savvy without any practical diminution in speed since (a) POINT= is actually very fast and (b) by the nature of the algorithm here, it accesses the records by their obs number in order.

For rendering joins using the hash object, if you have a compound key with numerous components and many data variables to retrieve, the MD5 method can be used on the key portion side and the RID method - on the data portion side. You can find examples of using both separately and combined in gory details in the hash book @DonH and I published a year ago in SAS Press, an excerpt from which you can preview at:

http://www.sas.com/storefront/aux/en/sphashtables/69153_excerpt.pdf

Kind regards

Paul D.

ChrisNZ · Posted 08-12-2019 12:42 AM

@hashman Thank you for these always very thorough explanations!

@rowlinglu To give you an idea how complex (unpredictable?) the relationship of data size to memory space is, consider these examples:

data T; length I 5; do I=1 to 1e7; output; end; run;
   
data TT;
  length I 5;
  dcl hash H(dataset:'T');
  H.definekey('I');
  H.definedata('I');
  H.definedone();
  I=H.item_size; putlog 'SIZE=' I ;
run;
/*SIZE=48*/
/*      memory              590805.06k*/
/*      OS Memory           622256.00k*/

data TT;            
  length I 5;
  dcl hash H(dataset:'T');
  H.definekey('I','I');
  H.definedone();
  I=H.item_size; putlog 'SIZE=' I ;
run;
/*SIZE=64*/
/*      memory              656369.10k*/
/*      OS Memory           683708.00k*/

So here we have 10 bytes of data (L) in both hash tables, but the item size (S) differs between the tables and increases from 48 to 64.

Even more unexpected, while S increases by 33%, the memory used increases by only 10%.

Using a variable length of 8 gives the sames results, so reducing the length brings no gains in this case.

Trial and error still seems to be the best method when trying to fine-tune the optimisation of a hash table.

Just a pebble added to the cairn that @hashman built with his detailed replies.

High-Performance SAS Coding - Third Edition

rowlinglu · Posted 08-26-2019 09:43 AM

Paul,

thank you so, so much for such detail reply. I think now I finally have an idea of how does that work.

Best

rowlinglu · Posted 08-11-2019 11:43 AM

Hi @hashman

One quick question. In your response you said:

"on 64-bit systems, the minimum hash item size is S=48 bytes regardless of L (summary length L of the key and data variables), as long as L <= 16. Adding just one byte makes it S=64, which remains such all the way to L=32..."

So it is better to limit the summary length L if I have large number of key/data variables in a hash table, right? By L<=16: you mean the size (in bytes) of the data and key variables.

is there a quick way to check the summary length L? In my case, for example, an example of the id variable is "34182374", prvdr_num is "440027", and thru_dt is "2013-04-01" does that mean my length is 8+6+10=24?

DonH suggested concatenating key variables into a composite key, I don't understand how does that limit L? Assume I have multiple keys here, does that make the combination into a number?

Thanks,

Ruolin

DonH · Posted 08-11-2019 12:27 PM

Hi @rowlinglu,

To clarify what i suggested was concatenating the keys and then taking that long character value and converting it to a binary string of length 16 using the MD5 hash function. Note that the hash functions in SAS are independent of the hash object/table. The hash object got its name, in large part, because the internal algorithms use hashing techniques.

hashman · Posted 08-09-2019 11:24 PM

@rowlinglu:

If the filtered one is okay, you shouldn't have any problems whatsoever with the hash memory footprint, even without increasing MEMSIZE to the max, because it is the data from the filtered one that is loaded into the hash table. With memory sizes nowadays routinely observed even on basic laptops, you'd most likely have no problem with loading the entire master file into a hash table, either.

In this case, the problem lies elsewhere - @ChrisNZ forgot (I'm sure, purely accidentally) to add a couple of lines to his hash code. It must have gone unnoticed because you test data set is so small that it didn't matter. I'll reply to @ChrisNZ separately right after finishing this; if you are curious to know why the code crashed on memory, read the response.

Kind regards

Paul D.

rowlinglu · Posted 08-10-2019 11:59 PM

@hashman

First of all, the MERGE statement works because the data was already sorted, so I am able to do MERGE with my dataset. For an unsorted dataset, I think the best solution should not be MERGE.

Since I am pretty new to SAS, I noticed the hash way of doing things is fascinating.

I thought it was the problem of memsize when I tried @ChrisNZ 's solution but it didn't work when I changed the memsize. Thank you guys so much for spending time figuring this out. I don't have access to the dataset right now, but I will definitely try this on Monday and let you know if it works. I think this would be helpful for people who have this kind of problem in the future, too! 🙂

Also, @DonH Thank you for pointing me to the article, it is very helpful!!

rowlinglu · Posted 08-11-2019 01:14 PM

data FNL.MATCHDATE_INPINP;
set ETL.MASTER_INPATIENT; if _n_ = 1 then do ;
if 0 then set ETL.FILTERED_OUTPATIENT;

I thought "if 0 then set" forces SAS to define all the variables that are part of FILTERED_OUTPATIENT, does this syntax work even when master_inpatient and filtered_outpatients have different variables?
I don't think I understand the difference between
"data xx;
if 0 then set filtered_outpatient; "
and
"data xx
set etl.master_inpatient;
if 0 then set etl.filtered_outpatient"
does that means the new dataset created follows the variables in etl.filtered_outpatient?
what if I want the new dataset contain all variables in etl.master_inpatient and etl.filtered_outpatient?

DonH · Posted 08-11-2019 03:55 PM

The article The SET Statement's Compile Time Functions should answer your question. It was written from the perspective of the use of the hash object. But it does explain the dual role of the SET statement (compile time vs. execution time).

rowlinglu · Posted 08-26-2019 09:45 AM

For everyone who is now viewing this question, please look through all threads posted by @ChrisNZ @hashman @DonH

hashman · Posted 08-08-2019 01:19 AM

@rowlinglu :

Do heed what @PGStats has said.

The reason your code runs like molasses is because you're doing your lookup by reading all the records from filtered_outpatient from disk for every record from master_impatient. This is not a reasonable way to organize a table lookup.

First, you have the files sorted, and yet your algorithm takes no advantage of their order, while you could simply use the MERGE statement since this is what SAS offers for sequential matching of ordered files. However, the solution @PGStats has offered does not even have to have the files sorted beforehand - SQL will do it behind-the-scenes, if need be, or choose a more efficient tactic (on the inner join, it will most likely store the smaller file in a hash table and look it up using the direct-addressing hash algorithm in memory for every record from the other file). The same can be also done in the DATA step explicitly by using the hash object or, for example, a key-indexed array.

Kind regards

Paul D.

rowlinglu · Posted 08-08-2019 10:23 AM

Thank you @hashman ! If I use the MERGE statement, I still need to prefilter the file, right? Which way you think would be the quickest?

Also, how does the merge statement work with dates (I am merging dates with <=1 difference in dates)

hashman · Posted 08-08-2019 11:27 AM

@rowlinglu :

MERGE does the matching, and you just need to indicate to SAS the matching condition using the IN= data set option. Also, the merge key variable should be the same on both files (and it's the only variable that should be the same, otherwise it's a job for UPDATE rather than MERGE). In your case:

data master_inpatient ;                                                            
  input iID (iadmsn_dt ithru_dt) (:yymmdd10.) iprvdr_num $1. ;                     
  format iadmsn_dt ithru_dt yymmdd10. ;                                            
  cards ;                                                                          
341   2013-04-01  2013-04-02    G                                                  
230   2013-06-01  2013-06-03    I                                                  
232   2013-07-31  2013-07-31    F                                                  
124   2013-04-29  2013-04-29    C                                                  
232   2013-07-31  2013-08-20    Q                                                  
run ;                                                                              
                                                                                   
data filtered_outpatient ;                                                         
  input ID thru_dt :yymmdd10. prvdr_num $1. ;                                      
  format thru_dt yymmdd10. ;                                                       
  cards ;                                                                          
124   2013-04-29   A                                                               
331   2013-06-14   G                                                               
439   2013-02-01   B                                                               
331   2013-06-14   D                                                               
run ;                                                                              
                                                                                   
proc sort data= master_inpatient;                                                  
    by iID iadmsn_dt;                                                              
run;                                                                               
proc sort data= filtered_outpatient;                                               
    by ID thru_dt;                                                                 
run;                                                                               
                                                                                   
data want ;                                                                        
  merge master_inpatient (in = mip) filtered_outpatient (rename=(ID=iID) in=fop) ; 
  by iID ;                                                                         
  if mip and fop ;                                                                 
  d = abs (iadmsn_dt - thru_dt) ;                                                  
  if d <= 1 ;                                                                      
run ;

Which will produce output similar to @ChrisNZ's SQL if your ID to iID relationship is one-to-one or one-to-many.

Kind regards

Paul D.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

SAS Innovate 2025: Call for Content

Classroom Training Available!