Merging 2 datasets with repeated measures and no unique identifier

Manhort · Posted 03-10-2022 02:48 PM

Hello all,

I am trying to merge 2 datasets based on a couple different variables but I can't manage it without having duplicated rows from the merge. Specifically, I need to end up with my 'want' table where the treatment name and date (from have2) are only added to rows in have1 if the 'drugDate' is the same as 'appointment' or is within 14 days after 'appointment' -> appointment <= drugDate <= appointment + 14

This example data would be for a single person from the full dataset.

have1

id	disease	appointment
4	x	5/9/2016
4	x	5/9/2016
4	x	5/9/2016
4	y	5/17/2016
4	y	5/17/2016
4	x	9/21/2016
4	x	9/21/2016
4	y	9/22/2016
4	x	9/22/2016
4	x	9/22/2016
4	x	9/22/2016
4	x	9/22/2016
4	x	9/22/2016
4	y	10/13/2016
4	y	10/13/2016
4	x	10/26/2016
4	x	10/26/2016
4	x	12/5/2016

have2

id	treatment	drugDate
4	a	9/21/2016
4	b	10/13/2016

want

id	disease	appointment	treatment	drugDate
4	x	5/9/2016
4	x	5/9/2016
4	x	5/9/2016
4	y	5/17/2016
4	y	5/17/2016
4	x	9/21/2016	a	9/21/2016
4	x	9/21/2016	a	9/21/2016
4	y	9/22/2016	a	9/21/2016
4	x	9/22/2016	a	9/21/2016
4	x	9/22/2016	a	9/21/2016
4	x	9/22/2016	a	9/21/2016
4	x	9/22/2016	a	9/21/2016
4	x	9/22/2016	a	9/21/2016
4	y	10/13/2016	b	10/13/2016
4	y	10/13/2016	b	10/13/2016
4	x	10/26/2016	b	10/13/2016
4	x	10/26/2016	b	10/13/2016
4	x	12/5/2016

Any help is appreciated.

Here's the input version.

DATA have1;
input id disease $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 x 5/9/2016
4 x 5/9/2016
4 x 5/9/2016
4 y 5/17/2016
4 y 5/17/2016
4 x 9/21/2016
4 x 9/21/2016
4 y 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 y 10/13/2016
4 y 10/13/2016
4 x 10/26/2016
4 x 10/26/2016
4 x 12/5/2016
;
RUN;

DATA have2;
input id treatment $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 a 9/21/2016
4 b 10/13/2016

;
RUN;

mkeintz · Posted 03-10-2022 03:30 PM

Edit note: I've struck out and added a couple of comments, once I decided to convert your test data to a SAS dataset.

Assuming that

HAVE1 and HAVE2 are both sorted (by ID/APPOINTMENT and ID/DRUGDATE respectively).
HAVE2 never has more than one record for a given ID/DATE

then

data want (drop=_:);
  if 0 then set have1 have2;
/*  replaced:  if lag(last.id) then call missing(of _all_);  **/

  set have2 (in=in2 keep=id drugdate    rename=(drugdate=_matchdate))
      have1 (in=in1 keep=id appointment rename=(appointment=_matchdate));
  by id _matchdate;
  if first.id then call missing(of _all_);

  if in2 then set have2;
  if in1 then set have1;
  if in1;
  if appointment-14>drugdate then call missing(treatment,drugdate);
run;

This relies on the following very useful and infrequently used behavior of SAS: namely that a variable read in via SET (or MERGE) is retained over multiple observations until another SET or MERGE is applied for that variable.

So the technique is this

Read in just the sort variables (ID and the dates, both renamed to _MATCHDATE), and request the IN= dummry vars
If IN1 then read in all vars from HAVE1, same with IN2 and HAVE2. This means that if a HAVE2 record is followed by multiple HAVE1 records, the vars from HAVE2 are retained until the next HAVE2.
output only records for which there is a HAVE1 observation, so a HAVE2 that falls between two HAVE1's will not generate an output record, although its TREATMENT and DRUGDATE values will be available. Thats the subsetting IF IN1 statement.
To eliminate stale DRUGDATE data, compare to appointment-14.
New comment: at the beginning of each ID, set the vars to missing, so that a DRUGDATE from one ID is not carried into the next ID.

The "if 0 then set" statement is there for one ~~two~~ reasons:

To preserve variable order from HAVE1 followed by vars from HAVE2.
~~To allow the subsequent call missing(of _all_) to set all vars to missing just prior to starting a new ID~~

Note the first executable SET statement has HAVE2 prior to HAVB1. So when there is a tied date, the DRUGDATE and TREATMENT vars are established prior to the next record with same date and IN1=1. The IN1 would generate an output record, and you want to have the HAVE2 data already in hand for the output.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Manhort · Posted 03-10-2022 04:59 PM

This is awesome. And you actually made me realize this issue I do in fact have:

HAVE2 never has more than one record for a given ID/DATE

If my have2 looks like this does this method just fall apart?

DATA have2;
input id treatment $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 a 9/21/2016

4 b 9/21/2016
4 b 10/13/2016

;
RUN;

Tom · Posted 03-10-2022 05:08 PM

If the treatment dataset only has one observation per ID then just use MERGE with ID as the BY variable.

If the treatment dataset has multiple observations per day then reduce it to one observation per day. Do it in a way that meets your needs. So if your example means that ID 4 took A and B on Sep21 then perhaps you want to set treatment to A+B instead.

data trt2;
  do until(last.drugdate);
     set trt;
     by id drugdate;
     length trt2 $100;
     trt2=catx('+',trt2,treatment);
  end;
run;

mkeintz · Posted 03-10-2022 05:20 PM

@Tom wrote:

If the treatment dataset only has one observation per ID then just use MERGE with ID as the BY variable.

Yes, but (1) there is a secondary by variable (date) that will force MERGE to drop the drugdate the moment the date changes. The OP wants to keep the drug info for 14 days, and not propagate it further throughout the ID.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 03-10-2022 05:24 PM

@Manhort wrote:

This is awesome. And you actually made me realize this issue I do in fact have:

HAVE2 never has more than one record for a given ID/DATE

If my have2 looks like this does this method just fall apart?

DATA have2;
input id treatment $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 a 9/21/2016

4 b 9/21/2016
4 b 10/13/2016

;
RUN;

The logic I showed would successfully propagate only the last DRUG data for a given DATE. In the no-more-than-one-drug-per-date scenario, the latest drug just replaces any earlier drug that has not yet become stale (14 days). The earlier drug only appears up to the day before the current drug. In the case you present here (two drugs, same date) the second immediately displaces the first.

But then you haven't sufficiently described how you would want your results to look when there are multiple drugs per date.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Manhort · Posted 03-10-2022 05:51 PM

You're right. Maybe I should rethink the final structure I require now. Something like this maybe. I've added a second subject for context and the original subject now has an overlapping drug date. The 'want' now has dummy variables for whether a person received a treatment that day.

want

id	disease	appointment	drugDate	trt a	trt b
3	x	5/9/2016	5/9/2016	1	0
3	x	5/9/2016	5/9/2016	1	0
3	x	5/12/2016	5/9/2016	1	0
3	x	5/30/2016		0	0
4	x	5/9/2016		0	0
4	x	5/9/2016		0	0
4	x	5/9/2016		0	0
4	y	5/17/2016		0	0
4	y	5/17/2016		0	0
4	x	9/21/2016	9/21/2016	1	1
4	x	9/21/2016	9/21/2016	1	1
4	y	9/22/2016	9/21/2016	1	1
4	x	9/22/2016	9/21/2016	1	1
4	x	9/22/2016	9/21/2016	1	1
4	x	9/22/2016	9/21/2016	1	1
4	x	9/22/2016	9/21/2016	1	1
4	x	9/22/2016	9/21/2016	1	1
4	y	10/13/2016	10/13/2016	0	1
4	y	10/13/2016	10/13/2016	0	1
4	x	10/26/2016	10/13/2016	0	1
4	x	10/26/2016	10/13/2016	0	1
4	x	12/5/2016		0	0

DATA have1;
input id disease $ appointment mmddyy10.;
format appointment mmddyy10.;
CARDS;
3 x 5/9/2016
3 x 5/9/2016
3 x 5/12/2016
3 x 5/30/2016
4 x 5/9/2016
4 x 5/9/2016
4 x 5/9/2016
4 y 5/17/2016
4 y 5/17/2016
4 x 9/21/2016
4 x 9/21/2016
4 y 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 y 10/13/2016
4 y 10/13/2016
4 x 10/26/2016
4 x 10/26/2016
4 x 12/5/2016
;
RUN;

DATA have2;
input id treatment $ drugDate mmddyy10.;
format drugDate mmddyy10.;
CARDS;
3 a 5/9/2016
4 a 9/21/2016
4 b 9/21/2016
4 b 10/13/2016
;
RUN;

mkeintz · Posted 03-10-2022 08:38 PM

In HAVE2, I think either the 2nd or 3rd row should have treatment b.

Beyond that, I don't think your WANT is consistent with HAVE1 and HAVE2. For example in ID4, how can you have DRUGDATE=10/14/2016, when HAVE2 has no such date. And how can appointment on 10/26/2016 show no DRUGDATE or TRT values, when the most recent drugdate is 10/13/2016, less than 14 days prior?

And finally, please provide the sample data in the form of a DATA step, which makes it much easier, and therefore much more likely, that you will get responses that have been tested.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Manhort · Posted 03-10-2022 10:20 PM

Sorry I made errors when pasting values on the tables. I fixed them above. I included a data step for the have1 and have2 data.

Merging 2 datasets with repeated measures and no unique identifier

Re: Merging 2 datasets with repeated measures and no unique identifier

Re: Merging 2 datasets with repeated measures and no unique identifier

Re: Merging 2 datasets with repeated measures and no unique identifier

Re: Merging 2 datasets with repeated measures and no unique identifier

Re: Merging 2 datasets with repeated measures and no unique identifier

Re: Merging 2 datasets with repeated measures and no unique identifier

Re: Merging 2 datasets with repeated measures and no unique identifier

Re: Merging 2 datasets with repeated measures and no unique identifier

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!