topic Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files) in SAS Programming

SAS— Merge Data Based On subject ID and Date (Within and Across Files)

rowlinglu — Thu, 08 Aug 2019 03:39:40 GMT

Here are two sample datasets (they are all fake datasets) outpatient https://drive.google.com/open?id=179_L_qnZdKY5-EZnwy4BEzyEZP-VT-in inpatient https://drive.google.com/open?id=1vhUa_yTflLEXSR6xdZG_hOYxCPRnXVkw I did not distinguish between filtered/master here, but it's easy to create filters (any random filter that restrict the dataset to a smaller one will serve as an example, it will work even if you don't distinguish, just duplicate and rename one as master and another as filtered)

I have two datasets that look like this, study inter-hospital transfer My dataset is enormous. I am able to do the whole thing in SAS, but is very very very slow :((( I will show my code here, but I am seeking ways to improve running time.

  master_inpatient
   ID   admsn_dt    thru_dt      prvdr_num       
    341   2013-04-01  2013-04-02    G
    230   2013-06-01  2013-06-03    I
    232   2013-07-31  2013-07-31    F
    124   2013-04-29  2013-04-29    C
    232   2013-07-31  2013-08-20    Q

  filtered_inpatient
   ID   admsn_dt    thru_dt      prvdr_num       
    341   2013-04-01  2013-04-02    G
    232   2013-07-31  2013-07-31    F
    232   2013-07-31  2013-08-20    Q

   master_outpatient
   ID     thru_dt     prvdr_num
    348   2013-09-23   Z
    124   2013-04-29   A
    331   2013-06-14   G
    439   2013-02-01   B
    331   2013-06-14   D

   filtered_outpatient
   ID     thru_dt     prvdr_num
    124   2013-04-29   A
    331   2013-06-14   G
    439   2013-02-01   B
    331   2013-06-14   D

I have two master datasets: an inpatient dataset and an outpatient dataset, and two filtered datasets: some filter on diganosis (e.g., including only patients with diagnosis of TB) is applied to the master dataset, make the dataset shorter than master dataset ID is patient ID, and admsn_dt is the day you are admitted to a hospital, thru_dt is the day you are discharged/transferred. Outpatient only has a thru_dt because in outpatient setting you don't need to be admitted into the hospital to be treated. Imagine that you can be transferred from an outpatient setting (ER) to an inpatient setting, an inpatient setting to an outpatient setting (ER), an outpatient setting (ER) to an outpatient setting (ER), and an inpatient setting to an inpatient setting (ER). As a result, there are four types of transfer happens in the two dataset.

I want the filtered dataset (filtered_inpatient or filtered_outpatient) to be the origin and master datasets (master_inpatient and master_outpatient) to be the destination because a patient need to be satisfied with some diagnosis, and then what we care is where he/she transferred (the patient don't need to have that diagnosis at the destination) In Sum: The four transfer type is if outpatient --> inpatient: filtered_outpatient(ID, thru_dt)--> master_inpatient(ID, admsn_dt) if outpatient --> outpatient: filtered_outpatient(ID, thru_dt)-->master_outpatient(ID,thru_dt) if inpatient --> inpatient: filtered_inpatient(ID, thru_dt)-->master_inpatient(ID,admsn_dt) if inpatient --> outpatient: filtered_inpatient(ID, thru_dt)-->master_inpatient(ID,thru_dt)

What I'd like to do is to obtain this third dataset, if the prvdr_num (provider number) are different, and the difference in date is less than 1 day (0 or 1). transtype indicate type of transfer: from inpatient to outpatient is inpout, for example.

The final dataset should look something like this:

   df3
   ID   fromdate     todate     from_prvdr  to_prvdr    d     transtype
    124   2013-04-29   2013-04-29  C           A          0      inpout
    232   2013-07-31   2013-07-31  F           Q          0      inpinp
    331   2013-06-14   2013-06-14  G           D          0      outout

Another thing is that, when matching within file, it's highly likely that you get something like this:

ID   fromdate     todate       from_prvdr    to_prvdr
1    3/30/2011    3/31/2011    43291         48329
1    3/31/2011    3/30/2011    48329         43291

OR 

ID   fromdate     todate       from_prvdr    to_prvdr
1    3/31/2011    3/31/2011    43291         48329
1    3/31/2011    3/31/2011    48329         43291

(In this latter case I can just exclude duplicate by date later in R, but I need to get rid of the first case)

Here is what I tried (and succeeded).

#this is an example of outpatient--> inpatient
#all variables in master datasets have an i prefix

proc sort data= etl.master_inpatient;
    by iID iadmsn_dt; 
run;

proc sort data= etl.filtered_outpatient;
    by ID thru_dt; 
run;

data fnl.matchdate_inpinp;
   set etl.master_inpatient end = eof;
      do p = 1 to num;
         set etl.filtered_outpatient nobs = num point = p;
         if iID = ID then do;
            d = abs(iadmsn_dt-thru_dt);
            put iID = ID = iadmsn_dt = thru_dt= d =;

         if d <= 1 then output;
         end;
         else continue;
      end;
      put '===========================';
   if eof then stop;
run;

There is no error in the code, but I have to do this seperately for four types of transfer, and merge them together in R later. I took me more than two days to finish running one year's data, I really want something more efficient as I have 8 year data.

Also, as I said, when matching within file, it is likely that we get some repetitive results (like described above), i really hope this can be solved.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

PGStats — Thu, 08 Aug 2019 04:24:12 GMT

Something along the lines of

proc sql;
create table fnl.matchdate_inpinp as
select 
    a.ID,
    iadmsn_dt as fromdate,
    thru_dt as todate,
    a.prvdr_num as from_prvdr,    
    b.prvdr_num as to_prvdr,
    abs(iadmsn_dt-thru_dt) as d
from 
    etl.filtered_outpatient as a inner join
    etl.master_inpatient as b on a.ID=b.iID and abs(iadmsn_dt-thru_dt) <= 1;
quit;

would be way more efficient.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

ChrisNZ — Thu, 08 Aug 2019 05:09:57 GMT

If you want to keep the same merge logic, but use a hash table instead of the dreadfully slow POINT= method, this could work:

data FNL.MATCHDATE_INPINP;
  set ETL.MASTER_INPATIENT;
  if 0 then set ETL.FILTERED_OUTPATIENT;
  dcl hash F_OUT(dataset:'ETL.FILTERED_OUTPATIENT',multidata:'y');
  F_OUT.defineKey('ID');
  F_OUT.defineData('THRU_DT','PRVDR_NUM');
  F_OUT.defineDone();
  F_OUT.reset_dup();
  do while(F_OUT.do_over(key:iID) eq 0);
    DIF = abs(IADMSN_DT-THRU_DT);
    if DIF <= 1 then output;
    putlog iID= ID= IADMSN_DT= THRU_DT= DIF =;
  end;
  putlog '===========================';
run;

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

hashman — Thu, 08 Aug 2019 05:19:46 GMT

@rowlinglu :

Do heed what @PGStats has said.

The reason your code runs like molasses is because you're doing your lookup by reading all the records from filtered_outpatient from disk for every record from master_impatient. This is not a reasonable way to organize a table lookup.

First, you have the files sorted, and yet your algorithm takes no advantage of their order, while you could simply use the MERGE statement since this is what SAS offers for sequential matching of ordered files. However, the solution @PGStats has offered does not even have to have the files sorted beforehand - SQL will do it behind-the-scenes, if need be, or choose a more efficient tactic (on the inner join, it will most likely store the smaller file in a hash table and look it up using the direct-addressing hash algorithm in memory for every record from the other file). The same can be also done in the DATA step explicitly by using the hash object or, for example, a key-indexed array.

Kind regards

Paul D.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

rowlinglu — Thu, 08 Aug 2019 15:06:29 GMT

Thank you @hashman ! If I use the MERGE statement, I still need to prefilter the file, right? Which way you think would be the quickest?

Also, how does the merge statement work with dates (I am merging dates with <=1 difference in dates)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

rowlinglu — Thu, 08 Aug 2019 14:24:12 GMT

Thank you very much @ChrisNZ , I will look into your code and let you know if it works!

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

rowlinglu — Thu, 08 Aug 2019 14:24:56 GMT

Thank you! @PGStats Do i need to pre filter the dataset?

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

hashman — Thu, 08 Aug 2019 15:27:26 GMT

@rowlinglu :

MERGE does the matching, and you just need to indicate to SAS the matching condition using the IN= data set option. Also, the merge key variable should be the same on both files (and it's the only variable that should be the same, otherwise it's a job for UPDATE rather than MERGE). In your case:

data master_inpatient ;                                                            
  input iID (iadmsn_dt ithru_dt) (:yymmdd10.) iprvdr_num $1. ;                     
  format iadmsn_dt ithru_dt yymmdd10. ;                                            
  cards ;                                                                          
341   2013-04-01  2013-04-02    G                                                  
230   2013-06-01  2013-06-03    I                                                  
232   2013-07-31  2013-07-31    F                                                  
124   2013-04-29  2013-04-29    C                                                  
232   2013-07-31  2013-08-20    Q                                                  
run ;                                                                              
                                                                                   
data filtered_outpatient ;                                                         
  input ID thru_dt :yymmdd10. prvdr_num $1. ;                                      
  format thru_dt yymmdd10. ;                                                       
  cards ;                                                                          
124   2013-04-29   A                                                               
331   2013-06-14   G                                                               
439   2013-02-01   B                                                               
331   2013-06-14   D                                                               
run ;                                                                              
                                                                                   
proc sort data= master_inpatient;                                                  
    by iID iadmsn_dt;                                                              
run;                                                                               
proc sort data= filtered_outpatient;                                               
    by ID thru_dt;                                                                 
run;                                                                               
                                                                                   
data want ;                                                                        
  merge master_inpatient (in = mip) filtered_outpatient (rename=(ID=iID) in=fop) ; 
  by iID ;                                                                         
  if mip and fop ;                                                                 
  d = abs (iadmsn_dt - thru_dt) ;                                                  
  if d <= 1 ;                                                                      
run ;

Which will produce output similar to @ChrisNZ's SQL if your ID to iID relationship is one-to-one or one-to-many.

Kind regards

Paul D.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

rowlinglu — Thu, 08 Aug 2019 19:28:40 GMT

@hashman Paul,

Thank you so much, this algorithm is a lot faster when I test it on a sample dataset, and make more sense to me, too! I will accept this as answer

Best,

Ruolin

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

hashman — Fri, 09 Aug 2019 03:21:26 GMT

@rowlinglu Ruolin:

I appreciate it but I have to say this: What I offered you wasn't intended as a solution but an illustration of how MERGE would work in your case since you had asked about it. If it happens to work as a solution to your problem, that's fine, too. However, the solutions offered by @PGStats and @ChrisNZ (and intended as solutions) are in fact better and faster than MERGE. This is because MERGE needs sorting, and it's a lot of work and resource usage for SAS in case of large files.

At the same time, @ChrisNZ's program doesn't have to sort at all due to the nature of his algorithm. @PGStats' program most likely doesn't sort, either, by internal default, for if the lookup file's keys and data can fit in memory, the SQL optimizer will opt for the SQXJHSH access method. If it happens to sort behind-the-scenes, you can see it by adding the _METHOD option to the SQL statement and turning the system option MSGLEVEL=I on; if you see the SQXJSRT access method reported in the log, then sorting is done on the file indicated. But SQL can be forced to avoid the sorting and use an internal hash table instead by coding the option MAGIC=103 with the SQL statement.

Bottom line, if I were in your shoes, I'd rather accept the solutions other than MERGE. I don't know if the policy of this site allows only one offer to be chosen as a solution. If it's the case, it's flat wrong because there's almost never such a thing in SAS as only one best solution to a given problem, except in cases when the OP specifically wants the code to be the shortest, the simplest, the fastest, or anything else labeled as "est". The people learning how to do things in SAS here benefit much more from a variety of solutions than from the (misguided) perception that if a solution is accepted, then this is the only one correct and acceptable.

Just my $.02.

Kind regards, Paul D

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

rowlinglu — Fri, 09 Aug 2019 14:22:46 GMT

@hashman Thank you very much for pointing that out. I don't think there is a way to accept multiple solutions. I chose MERGE because it turned out my dataset of already sorted beforehand and MERGE seems to work quickly. However, you are right, for other people who are seeing this question in the future, they will probably need to sort beforehand, and it's possible that they only look at the "solution". I will compare the algorithm and select a solution later. Thanks!

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

rowlinglu — Fri, 09 Aug 2019 22:39:36 GMT

@ChrisNZI got "Insufficient memory to data step program. The SAS system stopped processing this step because of insufficient memory." when using hash, is there any way to minimize memory usage? My master dataset is too large (filtered one is okay). I have increased SAS memory size to max.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

ChrisNZ — Sat, 10 Aug 2019 02:58:40 GMT

When using already sorted whole data sets, a merge is normally faster than any other method.

If you wanted to be able to load the hash table in memory, you need to increase MEMSIZE and/or to decrease data size, for example with 4-byte long dates and rightly sized strings ($1 for you it seems).

In some cases hash tables are just not a possible solution when the size of data is too large.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

hashman — Sat, 10 Aug 2019 03:24:45 GMT

@rowlinglu:

If the filtered one is okay, you shouldn't have any problems whatsoever with the hash memory footprint, even without increasing MEMSIZE to the max, because it is the data from the filtered one that is loaded into the hash table. With memory sizes nowadays routinely observed even on basic laptops, you'd most likely have no problem with loading the entire master file into a hash table, either.

In this case, the problem lies elsewhere - @ChrisNZ forgot (I'm sure, purely accidentally) to add a couple of lines to his hash code. It must have gone unnoticed because you test data set is so small that it didn't matter. I'll reply to @ChrisNZ separately right after finishing this; if you are curious to know why the code crashed on memory, read the response.

Kind regards

Paul D.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

hashman — Sat, 10 Aug 2019 04:14:08 GMT

@ChrisNZ:

In this case, the size of data doesn't matter - it's way too small to crash the hash on memory. What @rowlinglu didn't notice is that you had forgotten (beyond a shadow of a doubt, purely absent-mindedly) to include the _N_=1 condition to qualify the hash object declaration and instantiation. Should be:

data FNL.MATCHDATE_INPINP;
  set ETL.MASTER_INPATIENT;
  if _n_ = 1 then do ;
    if 0 then set ETL.FILTERED_OUTPATIENT;
    dcl hash F_OUT(dataset:'ETL.FILTERED_OUTPATIENT',multidata:'y');
    F_OUT.defineKey('ID');
    F_OUT.defineData('THRU_DT','PRVDR_NUM');
    F_OUT.defineDone();
  end ;
  F_OUT.reset_dup();
  do while(F_OUT.do_over(key:iID) eq 0);
    DIF = abs(IADMSN_DT-THRU_DT);
    if DIF <= 1 then output;
   * putlog iID= ID= IADMSN_DT= THRU_DT= DIF =;
  end;
run;

Without _N_=1, not only a new hash object instance is created for each record read from MASTER_INPATIENT, but also it gets fully loaded with the data from FILTERED_OUTPATIENT, while the previously created instants never get deleted. 8M instances of F_OUT would crash the memory even if their key and data portions were utterly empty.

Having said that, you're 100% right about the need to keep the memory-resident nature of the hash object in mind and take reasonable measures to control the lengths of the key and data portions. Saying "reasonable" because their summary length in bytes isn't an increasing monotonic but an increasing step-wise function of the summary length L of the key and data variables. For example, on 64-bit systems, the minimum hash item size is S=48 bytes regardless of L, as long as L <= 16. Adding just one byte makes it S=64, which remains such all the way to L=32, after which adding a single byte makes it S=80, and so on in S increments by 16. (The minimum S value of 48 has been my hash object pet peeve since @DonH and I ran into a practically untenable client-side situation given the available RAM resources.)

Kind regards

Paul D.

Kind regards

Paul D.

p.s. As a side note, even if that had been fixed and the OP tried the code without commenting out PUTLOG, the program would have crashed by overfilling the log ;).

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

ChrisNZ — Sat, 10 Aug 2019 11:13:50 GMT

@hashman Yes I wrote that code quickly and without data since no usable data was provided. Sorry about the oversight and thank you for fixing it.

Very interesting point about the item size increasing in steps. It's worse than what you describe if I recall correctly (I looked into this a good while ago, and probably in much less depth than you have).

On my 32 bit test system, using the wonky macro from http://support.sas.com/kb/34/193.html gives me

a 32-byte row size for one NUM key of length 8 and one CHAR data of length 8 (so 16 bytes in total)

but

a 40-byte row size for one NUM key of length 3 and one CHAR data of length 9 (so 12 bytes in total)

so less than 16 bytes of variable lengths can make the size of the hash item jump to the next increment.

The increment is 8 bytes in 32-bit systems. A 16-byte increment is very wasteful indeed.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

DonH — Sat, 10 Aug 2019 14:21:52 GMT

On the issue of the step function nature of the key size, that is one of the reasons that @hashman and I have gotten into the habit of creating an alternative variable to use as a composite key. Suppose your keys are K1,K2,...Kn, you can create a field as such:

_hKey = md5(catx(":",K1,K2, .... Kn));

This creates a short key value regardless of the number, type and length of the key variables. And it runs pretty quickly so having to create this variable on any secondary set of data to enable the lookup performs quite well.

There are a number of other benefits to this approach as well. Check out Key-Independent Uniform Segmentation of Arbitrary Input Using a Hash Function presented at SASGF in 2018 for a discussion of the details.

And thanks to @hashman for copying me on this thread.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

DonH — Sat, 10 Aug 2019 14:20:18 GMT

I would like to reinforce @hashman's point about evaluating alternative techniques. Check out the communities article Performance - Comparing SQL, MERGE and the Hash Object to Join/Merge SAS Tables which provides a comparison of Merge, SQL and the Hash object for a typical merge.

If you are processing large files and the code is to run repeatedly, it is probably worth the time and effort to evaluate all three of these approaches.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

hashman — Sat, 10 Aug 2019 18:16:19 GMT

@ChrisNZ :

Exactly: base 32 / increment 8 on 32-bit systems, base 48 / increment 16 on 64-bit ones.

In the case I described, @DonH and I were hit so hard by the increment 16 while aggregating using hashes because the hashes for various distinct counts had only one MD5-ed key (16 bytes) and in principle needed no data. But since you can't have a hash without at least one variable in the data portion, the natural idea was to save memory by putting a dummy $1 byte there - which turned out to be futile since for the resulting L=17 item size becomes S=64, with 47 bytes left over as memory waste. We had a bunch of hashes like that to account for different combos of real compound key variables going into MD5, and since the key was extremely discriminating, each of the hashes could have easily topped 100M items. Simple arithmetic (U=3.8 bytes of RAM per 1 useful byte stored) can explain the resulting frustration.

Kind regards

Paul D.

p.s. By comparison, hashing the same "by hand" (i.e. using an array, as in my hash efforts before 9.0) with array items $16 requires only U=1.4 even for a hash table with load factor 0.5 (i.e. half empty to practically completely avoid primary clustering if linear probing is used to resolve the key collisions). Too bad such a table has be allocated all at once at compile and can't grow and shrink at run time.

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

hashman — Sat, 10 Aug 2019 18:38:28 GMT

@DonH : Thanks for chiming in!

To give folks an even better idea of the magnitude of RAM saving that can be achieved using this subterfuge, the compound key we were having to deal with at the time when we ideated this trick could reach 500 bytes in length. Replacing it with its MD5 signature meant the hash key portion with the length of $16 instead. As they say, feel the difference (especially for a hash table with tens of millions of items).

Kind regards

Paul D.