Hello all,
I am trying to merge 2 datasets based on a couple different variables but I can't manage it without having duplicated rows from the merge. Specifically, I need to end up with my 'want' table where the treatment name and date (from have2) are only added to rows in have1 if the 'drugDate' is the same as 'appointment' or is within 14 days after 'appointment' -> appointment <= drugDate <= appointment + 14
This example data would be for a single person from the full dataset.
have1 | ||
id | disease | appointment |
4 | x | 5/9/2016 |
4 | x | 5/9/2016 |
4 | x | 5/9/2016 |
4 | y | 5/17/2016 |
4 | y | 5/17/2016 |
4 | x | 9/21/2016 |
4 | x | 9/21/2016 |
4 | y | 9/22/2016 |
4 | x | 9/22/2016 |
4 | x | 9/22/2016 |
4 | x | 9/22/2016 |
4 | x | 9/22/2016 |
4 | x | 9/22/2016 |
4 | y | 10/13/2016 |
4 | y | 10/13/2016 |
4 | x | 10/26/2016 |
4 | x | 10/26/2016 |
4 | x | 12/5/2016 |
have2 | ||
id | treatment | drugDate |
4 | a | 9/21/2016 |
4 | b | 10/13/2016 |
want | ||||
id | disease | appointment | treatment | drugDate |
4 | x | 5/9/2016 | ||
4 | x | 5/9/2016 | ||
4 | x | 5/9/2016 | ||
4 | y | 5/17/2016 | ||
4 | y | 5/17/2016 | ||
4 | x | 9/21/2016 | a | 9/21/2016 |
4 | x | 9/21/2016 | a | 9/21/2016 |
4 | y | 9/22/2016 | a | 9/21/2016 |
4 | x | 9/22/2016 | a | 9/21/2016 |
4 | x | 9/22/2016 | a | 9/21/2016 |
4 | x | 9/22/2016 | a | 9/21/2016 |
4 | x | 9/22/2016 | a | 9/21/2016 |
4 | x | 9/22/2016 | a | 9/21/2016 |
4 | y | 10/13/2016 | b | 10/13/2016 |
4 | y | 10/13/2016 | b | 10/13/2016 |
4 | x | 10/26/2016 | b | 10/13/2016 |
4 | x | 10/26/2016 | b | 10/13/2016 |
4 | x | 12/5/2016 |
Any help is appreciated.
Here's the input version.
DATA have1;
input id disease $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 x 5/9/2016
4 x 5/9/2016
4 x 5/9/2016
4 y 5/17/2016
4 y 5/17/2016
4 x 9/21/2016
4 x 9/21/2016
4 y 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 y 10/13/2016
4 y 10/13/2016
4 x 10/26/2016
4 x 10/26/2016
4 x 12/5/2016
;
RUN;
DATA have2;
input id treatment $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 a 9/21/2016
4 b 10/13/2016
;
RUN;
Edit note: I've struck out and added a couple of comments, once I decided to convert your test data to a SAS dataset.
Assuming that
then
data want (drop=_:);
if 0 then set have1 have2;
/* replaced: if lag(last.id) then call missing(of _all_); **/
set have2 (in=in2 keep=id drugdate rename=(drugdate=_matchdate))
have1 (in=in1 keep=id appointment rename=(appointment=_matchdate));
by id _matchdate;
if first.id then call missing(of _all_);
if in2 then set have2;
if in1 then set have1;
if in1;
if appointment-14>drugdate then call missing(treatment,drugdate);
run;
This relies on the following very useful and infrequently used behavior of SAS: namely that a variable read in via SET (or MERGE) is retained over multiple observations until another SET or MERGE is applied for that variable.
So the technique is this
The "if 0 then set" statement is there for one two reasons:
Note the first executable SET statement has HAVE2 prior to HAVB1. So when there is a tied date, the DRUGDATE and TREATMENT vars are established prior to the next record with same date and IN1=1. The IN1 would generate an output record, and you want to have the HAVE2 data already in hand for the output.
This is awesome. And you actually made me realize this issue I do in fact have:
If my have2 looks like this does this method just fall apart?
DATA have2;
input id treatment $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 a 9/21/2016
4 b 9/21/2016
4 b 10/13/2016
;
RUN;
If the treatment dataset only has one observation per ID then just use MERGE with ID as the BY variable.
If the treatment dataset has multiple observations per day then reduce it to one observation per day. Do it in a way that meets your needs. So if your example means that ID 4 took A and B on Sep21 then perhaps you want to set treatment to A+B instead.
data trt2;
do until(last.drugdate);
set trt;
by id drugdate;
length trt2 $100;
trt2=catx('+',trt2,treatment);
end;
run;
@Tom wrote:
If the treatment dataset only has one observation per ID then just use MERGE with ID as the BY variable.
Yes, but (1) there is a secondary by variable (date) that will force MERGE to drop the drugdate the moment the date changes. The OP wants to keep the drug info for 14 days, and not propagate it further throughout the ID.
@Manhort wrote:
This is awesome. And you actually made me realize this issue I do in fact have:
- HAVE2 never has more than one record for a given ID/DATE
If my have2 looks like this does this method just fall apart?
DATA have2;
input id treatment $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 a 9/21/20164 b 9/21/2016
4 b 10/13/2016;
RUN;
The logic I showed would successfully propagate only the last DRUG data for a given DATE. In the no-more-than-one-drug-per-date scenario, the latest drug just replaces any earlier drug that has not yet become stale (14 days). The earlier drug only appears up to the day before the current drug. In the case you present here (two drugs, same date) the second immediately displaces the first.
But then you haven't sufficiently described how you would want your results to look when there are multiple drugs per date.
You're right. Maybe I should rethink the final structure I require now. Something like this maybe. I've added a second subject for context and the original subject now has an overlapping drug date. The 'want' now has dummy variables for whether a person received a treatment that day.
want
id | disease | appointment | drugDate | trt a | trt b |
3 | x | 5/9/2016 | 5/9/2016 | 1 | 0 |
3 | x | 5/9/2016 | 5/9/2016 | 1 | 0 |
3 | x | 5/12/2016 | 5/9/2016 | 1 | 0 |
3 | x | 5/30/2016 | 0 | 0 | |
4 | x | 5/9/2016 | 0 | 0 | |
4 | x | 5/9/2016 | 0 | 0 | |
4 | x | 5/9/2016 | 0 | 0 | |
4 | y | 5/17/2016 | 0 | 0 | |
4 | y | 5/17/2016 | 0 | 0 | |
4 | x | 9/21/2016 | 9/21/2016 | 1 | 1 |
4 | x | 9/21/2016 | 9/21/2016 | 1 | 1 |
4 | y | 9/22/2016 | 9/21/2016 | 1 | 1 |
4 | x | 9/22/2016 | 9/21/2016 | 1 | 1 |
4 | x | 9/22/2016 | 9/21/2016 | 1 | 1 |
4 | x | 9/22/2016 | 9/21/2016 | 1 | 1 |
4 | x | 9/22/2016 | 9/21/2016 | 1 | 1 |
4 | x | 9/22/2016 | 9/21/2016 | 1 | 1 |
4 | y | 10/13/2016 | 10/13/2016 | 0 | 1 |
4 | y | 10/13/2016 | 10/13/2016 | 0 | 1 |
4 | x | 10/26/2016 | 10/13/2016 | 0 | 1 |
4 | x | 10/26/2016 | 10/13/2016 | 0 | 1 |
4 | x | 12/5/2016 | 0 | 0 |
DATA have1;
input id disease $ appointment mmddyy10.;
format appointment mmddyy10.;
CARDS;
3 x 5/9/2016
3 x 5/9/2016
3 x 5/12/2016
3 x 5/30/2016
4 x 5/9/2016
4 x 5/9/2016
4 x 5/9/2016
4 y 5/17/2016
4 y 5/17/2016
4 x 9/21/2016
4 x 9/21/2016
4 y 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 y 10/13/2016
4 y 10/13/2016
4 x 10/26/2016
4 x 10/26/2016
4 x 12/5/2016
;
RUN;
DATA have2;
input id treatment $ drugDate mmddyy10.;
format drugDate mmddyy10.;
CARDS;
3 a 5/9/2016
4 a 9/21/2016
4 b 9/21/2016
4 b 10/13/2016
;
RUN;
In HAVE2, I think either the 2nd or 3rd row should have treatment b.
Beyond that, I don't think your WANT is consistent with HAVE1 and HAVE2. For example in ID4, how can you have DRUGDATE=10/14/2016, when HAVE2 has no such date. And how can appointment on 10/26/2016 show no DRUGDATE or TRT values, when the most recent drugdate is 10/13/2016, less than 14 days prior?
And finally, please provide the sample data in the form of a DATA step, which makes it much easier, and therefore much more likely, that you will get responses that have been tested.
Sorry I made errors when pasting values on the tables. I fixed them above. I included a data step for the have1 and have2 data.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.