Solved: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

rowlinglu · Posted 08-07-2019 11:39 PM

Here are two sample datasets (they are all fake datasets) outpatient https://drive.google.com/open?id=179_L_qnZdKY5-EZnwy4BEzyEZP-VT-in inpatient https://drive.google.com/open?id=1vhUa_yTflLEXSR6xdZG_hOYxCPRnXVkw I did not distinguish between filtered/master here, but it's easy to create filters (any random filter that restrict the dataset to a smaller one will serve as an example, it will work even if you don't distinguish, just duplicate and rename one as master and another as filtered)

I have two datasets that look like this, study inter-hospital transfer My dataset is enormous. I am able to do the whole thing in SAS, but is very very very slow :((( I will show my code here, but I am seeking ways to improve running time.

  master_inpatient
   ID   admsn_dt    thru_dt      prvdr_num       
    341   2013-04-01  2013-04-02    G
    230   2013-06-01  2013-06-03    I
    232   2013-07-31  2013-07-31    F
    124   2013-04-29  2013-04-29    C
    232   2013-07-31  2013-08-20    Q

  filtered_inpatient
   ID   admsn_dt    thru_dt      prvdr_num       
    341   2013-04-01  2013-04-02    G
    232   2013-07-31  2013-07-31    F
    232   2013-07-31  2013-08-20    Q

   master_outpatient
   ID     thru_dt     prvdr_num
    348   2013-09-23   Z
    124   2013-04-29   A
    331   2013-06-14   G
    439   2013-02-01   B
    331   2013-06-14   D

   filtered_outpatient
   ID     thru_dt     prvdr_num
    124   2013-04-29   A
    331   2013-06-14   G
    439   2013-02-01   B
    331   2013-06-14   D

I have two master datasets: an inpatient dataset and an outpatient dataset, and two filtered datasets: some filter on diganosis (e.g., including only patients with diagnosis of TB) is applied to the master dataset, make the dataset shorter than master dataset ID is patient ID, and admsn_dt is the day you are admitted to a hospital, thru_dt is the day you are discharged/transferred. Outpatient only has a thru_dt because in outpatient setting you don't need to be admitted into the hospital to be treated. Imagine that you can be transferred from an outpatient setting (ER) to an inpatient setting, an inpatient setting to an outpatient setting (ER), an outpatient setting (ER) to an outpatient setting (ER), and an inpatient setting to an inpatient setting (ER). As a result, there are four types of transfer happens in the two dataset.

I want the filtered dataset (filtered_inpatient or filtered_outpatient) to be the origin and master datasets (master_inpatient and master_outpatient) to be the destination because a patient need to be satisfied with some diagnosis, and then what we care is where he/she transferred (the patient don't need to have that diagnosis at the destination) In Sum: The four transfer type is if outpatient --> inpatient: filtered_outpatient(ID, thru_dt)--> master_inpatient(ID, admsn_dt) if outpatient --> outpatient: filtered_outpatient(ID, thru_dt)-->master_outpatient(ID,thru_dt) if inpatient --> inpatient: filtered_inpatient(ID, thru_dt)-->master_inpatient(ID,admsn_dt) if inpatient --> outpatient: filtered_inpatient(ID, thru_dt)-->master_inpatient(ID,thru_dt)

What I'd like to do is to obtain this third dataset, if the prvdr_num (provider number) are different, and the difference in date is less than 1 day (0 or 1). transtype indicate type of transfer: from inpatient to outpatient is inpout, for example.

The final dataset should look something like this:

   df3
   ID   fromdate     todate     from_prvdr  to_prvdr    d     transtype
    124   2013-04-29   2013-04-29  C           A          0      inpout
    232   2013-07-31   2013-07-31  F           Q          0      inpinp
    331   2013-06-14   2013-06-14  G           D          0      outout

Another thing is that, when matching within file, it's highly likely that you get something like this:

ID   fromdate     todate       from_prvdr    to_prvdr
1    3/30/2011    3/31/2011    43291         48329
1    3/31/2011    3/30/2011    48329         43291

OR 

ID   fromdate     todate       from_prvdr    to_prvdr
1    3/31/2011    3/31/2011    43291         48329
1    3/31/2011    3/31/2011    48329         43291

(In this latter case I can just exclude duplicate by date later in R, but I need to get rid of the first case)

Here is what I tried (and succeeded).

#this is an example of outpatient--> inpatient
#all variables in master datasets have an i prefix

proc sort data= etl.master_inpatient;
    by iID iadmsn_dt; 
run;

proc sort data= etl.filtered_outpatient;
    by ID thru_dt; 
run;

data fnl.matchdate_inpinp;
   set etl.master_inpatient end = eof;
      do p = 1 to num;
         set etl.filtered_outpatient nobs = num point = p;
         if iID = ID then do;
            d = abs(iadmsn_dt-thru_dt);
            put iID = ID = iadmsn_dt = thru_dt= d =;

         if d <= 1 then output;
         end;
         else continue;
      end;
      put '===========================';
   if eof then stop;
run;

There is no error in the code, but I have to do this seperately for four types of transfer, and merge them together in R later. I took me more than two days to finish running one year's data, I really want something more efficient as I have 8 year data.

Also, as I said, when matching within file, it is likely that we get some repetitive results (like described above), i really hope this can be solved.

ChrisNZ · Posted 08-08-2019 01:09 AM

If you want to keep the same merge logic, but use a hash table instead of the dreadfully slow POINT= method, this could work:

data FNL.MATCHDATE_INPINP;
  set ETL.MASTER_INPATIENT;
  if 0 then set ETL.FILTERED_OUTPATIENT;
  dcl hash F_OUT(dataset:'ETL.FILTERED_OUTPATIENT',multidata:'y');
  F_OUT.defineKey('ID');
  F_OUT.defineData('THRU_DT','PRVDR_NUM');
  F_OUT.defineDone();
  F_OUT.reset_dup();
  do while(F_OUT.do_over(key:iID) eq 0);
    DIF = abs(IADMSN_DT-THRU_DT);
    if DIF <= 1 then output;
    putlog iID= ID= IADMSN_DT= THRU_DT= DIF =;
  end;
  putlog '===========================';
run;

High-Performance SAS Coding - Third Edition

View solution in original post

PGStats · Posted 08-08-2019 12:24 AM

Something along the lines of

proc sql;
create table fnl.matchdate_inpinp as
select 
    a.ID,
    iadmsn_dt as fromdate,
    thru_dt as todate,
    a.prvdr_num as from_prvdr,    
    b.prvdr_num as to_prvdr,
    abs(iadmsn_dt-thru_dt) as d
from 
    etl.filtered_outpatient as a inner join
    etl.master_inpatient as b on a.ID=b.iID and abs(iadmsn_dt-thru_dt) <= 1;
quit;

would be way more efficient.

PG

rowlinglu · Posted 08-08-2019 10:24 AM

Thank you! @PGStats Do i need to pre filter the dataset?

ChrisNZ · Posted 08-08-2019 01:09 AM

If you want to keep the same merge logic, but use a hash table instead of the dreadfully slow POINT= method, this could work:

data FNL.MATCHDATE_INPINP;
  set ETL.MASTER_INPATIENT;
  if 0 then set ETL.FILTERED_OUTPATIENT;
  dcl hash F_OUT(dataset:'ETL.FILTERED_OUTPATIENT',multidata:'y');
  F_OUT.defineKey('ID');
  F_OUT.defineData('THRU_DT','PRVDR_NUM');
  F_OUT.defineDone();
  F_OUT.reset_dup();
  do while(F_OUT.do_over(key:iID) eq 0);
    DIF = abs(IADMSN_DT-THRU_DT);
    if DIF <= 1 then output;
    putlog iID= ID= IADMSN_DT= THRU_DT= DIF =;
  end;
  putlog '===========================';
run;

High-Performance SAS Coding - Third Edition

rowlinglu · Posted 08-08-2019 10:24 AM

Thank you very much @ChrisNZ , I will look into your code and let you know if it works!

rowlinglu · Posted 08-09-2019 06:39 PM

@ChrisNZI got "Insufficient memory to data step program. The SAS system stopped processing this step because of insufficient memory." when using hash, is there any way to minimize memory usage? My master dataset is too large (filtered one is okay). I have increased SAS memory size to max.

ChrisNZ · Posted 08-09-2019 10:54 PM

When using already sorted whole data sets, a merge is normally faster than any other method.

If you wanted to be able to load the hash table in memory, you need to increase MEMSIZE and/or to decrease data size, for example with 4-byte long dates and rightly sized strings ($1 for you it seems).

In some cases hash tables are just not a possible solution when the size of data is too large.

High-Performance SAS Coding - Third Edition

hashman · Posted 08-10-2019 12:14 AM

@ChrisNZ:

In this case, the size of data doesn't matter - it's way too small to crash the hash on memory. What @rowlinglu didn't notice is that you had forgotten (beyond a shadow of a doubt, purely absent-mindedly) to include the _N_=1 condition to qualify the hash object declaration and instantiation. Should be:

data FNL.MATCHDATE_INPINP;
  set ETL.MASTER_INPATIENT;
  if _n_ = 1 then do ;
    if 0 then set ETL.FILTERED_OUTPATIENT;
    dcl hash F_OUT(dataset:'ETL.FILTERED_OUTPATIENT',multidata:'y');
    F_OUT.defineKey('ID');
    F_OUT.defineData('THRU_DT','PRVDR_NUM');
    F_OUT.defineDone();
  end ;
  F_OUT.reset_dup();
  do while(F_OUT.do_over(key:iID) eq 0);
    DIF = abs(IADMSN_DT-THRU_DT);
    if DIF <= 1 then output;
   * putlog iID= ID= IADMSN_DT= THRU_DT= DIF =;
  end;
run;

Without _N_=1, not only a new hash object instance is created for each record read from MASTER_INPATIENT, but also it gets fully loaded with the data from FILTERED_OUTPATIENT, while the previously created instants never get deleted. 8M instances of F_OUT would crash the memory even if their key and data portions were utterly empty.

Having said that, you're 100% right about the need to keep the memory-resident nature of the hash object in mind and take reasonable measures to control the lengths of the key and data portions. Saying "reasonable" because their summary length in bytes isn't an increasing monotonic but an increasing step-wise function of the summary length L of the key and data variables. For example, on 64-bit systems, the minimum hash item size is S=48 bytes regardless of L, as long as L <= 16. Adding just one byte makes it S=64, which remains such all the way to L=32, after which adding a single byte makes it S=80, and so on in S increments by 16. (The minimum S value of 48 has been my hash object pet peeve since @DonH and I ran into a practically untenable client-side situation given the available RAM resources.)

Kind regards

Paul D.

Kind regards

Paul D.

p.s. As a side note, even if that had been fixed and the OP tried the code without commenting out PUTLOG, the program would have crashed by overfilling the log ;).

ChrisNZ · Posted 08-10-2019 07:13 AM

@hashman Yes I wrote that code quickly and without data since no usable data was provided. Sorry about the oversight and thank you for fixing it.

Very interesting point about the item size increasing in steps. It's worse than what you describe if I recall correctly (I looked into this a good while ago, and probably in much less depth than you have).

On my 32 bit test system, using the wonky macro from http://support.sas.com/kb/34/193.html gives me

a 32-byte row size for one NUM key of length 8 and one CHAR data of length 8 (so 16 bytes in total)

but

a 40-byte row size for one NUM key of length 3 and one CHAR data of length 9 (so 12 bytes in total)

so less than 16 bytes of variable lengths can make the size of the hash item jump to the next increment.

The increment is 8 bytes in 32-bit systems. A 16-byte increment is very wasteful indeed.

High-Performance SAS Coding - Third Edition

DonH · Posted 08-10-2019 10:08 AM

On the issue of the step function nature of the key size, that is one of the reasons that @hashman and I have gotten into the habit of creating an alternative variable to use as a composite key. Suppose your keys are K1,K2,...Kn, you can create a field as such:

_hKey = md5(catx(":",K1,K2, .... Kn));

This creates a short key value regardless of the number, type and length of the key variables. And it runs pretty quickly so having to create this variable on any secondary set of data to enable the lookup performs quite well.

There are a number of other benefits to this approach as well. Check out Key-Independent Uniform Segmentation of Arbitrary Input Using a Hash Function presented at SASGF in 2018 for a discussion of the details.

And thanks to @hashman for copying me on this thread.

hashman · Posted 08-10-2019 02:38 PM

@DonH : Thanks for chiming in!

To give folks an even better idea of the magnitude of RAM saving that can be achieved using this subterfuge, the compound key we were having to deal with at the time when we ideated this trick could reach 500 bytes in length. Replacing it with its MD5 signature meant the hash key portion with the length of $16 instead. As they say, feel the difference (especially for a hash table with tens of millions of items).

Kind regards

Paul D.

hashman · Posted 08-10-2019 02:16 PM

@ChrisNZ :

Exactly: base 32 / increment 8 on 32-bit systems, base 48 / increment 16 on 64-bit ones.

In the case I described, @DonH and I were hit so hard by the increment 16 while aggregating using hashes because the hashes for various distinct counts had only one MD5-ed key (16 bytes) and in principle needed no data. But since you can't have a hash without at least one variable in the data portion, the natural idea was to save memory by putting a dummy $1 byte there - which turned out to be futile since for the resulting L=17 item size becomes S=64, with 47 bytes left over as memory waste. We had a bunch of hashes like that to account for different combos of real compound key variables going into MD5, and since the key was extremely discriminating, each of the hashes could have easily topped 100M items. Simple arithmetic (U=3.8 bytes of RAM per 1 useful byte stored) can explain the resulting frustration.

Kind regards

Paul D.

p.s. By comparison, hashing the same "by hand" (i.e. using an array, as in my hash efforts before 9.0) with array items $16 requires only U=1.4 even for a hash table with load factor 0.5 (i.e. half empty to practically completely avoid primary clustering if linear probing is used to resolve the key collisions). Too bad such a table has be allocated all at once at compile and can't grow and shrink at run time.

ChrisNZ · Posted 08-12-2019 12:59 AM

@hashman

> But since you can't have a hash without at least one variable in the data portion,

I don't understand this. The method definedata() does not have to be called does it?

High-Performance SAS Coding - Third Edition

hashman · Posted 08-12-2019 04:11 AM

@ChrisNZ:

You cannot have just the key portion without the data portion - this is how the hash object is designed. Unfortunately; since at times we want nothing more than finding out if a key is in the table. Hence, at least 1 hash variable has to be in the data portion.
This principle is unrelated to the way of coding. You can code the data portion explicitly using DEFINEDATA or you can just code DEFINEKEY. But:
If you code only DEFINEKEY and omit DEFINEDATA, then every hash variable specified to the former will be automatically included in the data portion, as if you've also coded DEFINEDATA with the same variables.
But if you code only DEFINEDATA without DEFINEKEY, the hash will crash.
If all you want from a hash is CHECK(), you're interested in making the data portion as small as possible to save memory. Hence, just code DEFINEDATA with some dummy variable like _DUMMY $1. It's particularly sensible if your key portion is long in bytes. But if it's short - say, one numeric variable or a character variable no longer than $8, you can simply omit DEFINEDATA, as then you will have L<=16 and get the minimal S=48, anyway.
Thus, normally, if I want the same variable(s) in the key portion and the data portion - and want the data portion variable(s) to materialize in the PDV upon retrieval, whether it's done by FIND, FIND_NEXT, DO_OVER, hash iterator, and OUTPUT - I just code DEFINEKEY with the variables I need.

As a side note, although OUTPUT seems like it copies the data portion hash variables directly from the hash table to the specified output file, in reality it first copies their values to their PDV host variables, from where they're written to the output. In fact, the PDV host variables are the only channel through which the hash object communicates its data, whether it receives it or emits it. That's why parameter type matching is so important: The hash object cannot do anything with the hash variables defined to it if the variables with the same names haven't been already defined in the PDV at compile time; and that's why DEFINEDONE checks for it and vomits on your shoes if it's not the case.

Kind regards

Paul D.

ChrisNZ · Posted 08-12-2019 05:30 AM

@hashman Thank you once again for your ever clear and detailed explanations!

High-Performance SAS Coding - Third Edition

SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Re: SAS— Merge Data Based On subject ID and Date (Within and Across Files)

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away