BookmarkSubscribeRSS Feed
Manhort
Obsidian | Level 7

Hello all,

 

I am trying to merge 2 datasets based on a couple different variables but I can't manage it without having duplicated rows from the merge. Specifically, I need to end up with my 'want' table where the treatment name and date (from have2) are only added to rows in have1 if the 'drugDate' is the same as 'appointment' or is within 14 days after 'appointment' ->  appointment <= drugDate <= appointment + 14

 

This example data would be for a single person from the full dataset.

 

have1  
   
iddiseaseappointment
4x5/9/2016
4x5/9/2016
4x5/9/2016
4y5/17/2016
4y5/17/2016
4x9/21/2016
4x9/21/2016
4y9/22/2016
4x9/22/2016
4x9/22/2016
4x9/22/2016
4x9/22/2016
4x9/22/2016
4y10/13/2016
4y10/13/2016
4x10/26/2016
4x10/26/2016
4x12/5/2016

 

have2  
   
idtreatmentdrugDate
4a9/21/2016
4b10/13/2016

 

want    
     
iddiseaseappointmenttreatmentdrugDate
4x5/9/2016  
4x5/9/2016  
4x5/9/2016  
4y5/17/2016  
4y5/17/2016  
4x9/21/2016a9/21/2016
4x9/21/2016a9/21/2016
4y9/22/2016a9/21/2016
4x9/22/2016a9/21/2016
4x9/22/2016a9/21/2016
4x9/22/2016a9/21/2016
4x9/22/2016a9/21/2016
4x9/22/2016a9/21/2016
4y10/13/2016b10/13/2016
4y10/13/2016b10/13/2016
4x10/26/2016b10/13/2016
4x10/26/2016b10/13/2016
4x12/5/2016  

 

Any help is appreciated.

Here's the input version.


DATA have1;
input id disease $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 x 5/9/2016
4 x 5/9/2016
4 x 5/9/2016
4 y 5/17/2016
4 y 5/17/2016
4 x 9/21/2016
4 x 9/21/2016
4 y 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 y 10/13/2016
4 y 10/13/2016
4 x 10/26/2016
4 x 10/26/2016
4 x 12/5/2016
;
RUN;


DATA have2;
input id treatment $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 a 9/21/2016
4 b 10/13/2016

;
RUN;

 

8 REPLIES 8
mkeintz
PROC Star

Edit note:  I've struck out and added a couple of comments, once I decided to convert your test data to a SAS dataset.

 

 

Assuming that

  1. HAVE1 and HAVE2 are both sorted (by ID/APPOINTMENT  and ID/DRUGDATE respectively).
  2. HAVE2 never has more than one record for a given ID/DATE

then

data want (drop=_:);
  if 0 then set have1 have2;
/*  replaced:  if lag(last.id) then call missing(of _all_);  **/

  set have2 (in=in2 keep=id drugdate    rename=(drugdate=_matchdate))
      have1 (in=in1 keep=id appointment rename=(appointment=_matchdate));
  by id _matchdate;
  if first.id then call missing(of _all_);

  if in2 then set have2;
  if in1 then set have1;
  if in1;
  if appointment-14>drugdate then call missing(treatment,drugdate);
run;

This relies on the following very useful and infrequently used behavior of SAS: namely that a variable read in via SET (or MERGE) is retained over multiple observations until another SET or MERGE is applied for that variable.  

 

So the technique is this

  1. Read in just the sort variables (ID  and the dates, both renamed to _MATCHDATE), and request the IN= dummry vars
  2. If IN1 then read in all vars from HAVE1, same with IN2 and HAVE2.   This means that if a HAVE2 record is followed by multiple HAVE1 records, the vars from HAVE2 are retained until the next HAVE2.
  3. output only records for which there is a HAVE1 observation, so a HAVE2 that falls between two HAVE1's will not generate an output record, although its TREATMENT and DRUGDATE values will be available.   Thats the subsetting IF IN1 statement.
  4. To eliminate stale DRUGDATE data, compare to appointment-14.
  5. New comment: at the beginning of each ID, set the vars to missing, so that a DRUGDATE from one ID is not carried into the next ID.

 

The "if 0 then set" statement is there for one two reasons:

  1. To preserve variable order from HAVE1 followed by vars from HAVE2.  
  2. To allow the subsequent call missing(of _all_) to set all vars to missing just prior to starting a new ID

 

Note the first executable SET statement has HAVE2 prior to HAVB1.  So when there is a tied date, the DRUGDATE and TREATMENT vars are established prior to the next record with same date and IN1=1.   The IN1 would generate an output record, and you want to have the HAVE2 data already in hand for the output.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Manhort
Obsidian | Level 7

This is awesome. And you actually made me realize this issue I do in fact have:

 

  1. HAVE2 never has more than one record for a given ID/DATE

 

If my have2 looks like this does this method just fall apart?

 

DATA have2;
input id treatment $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 a 9/21/2016

4 b 9/21/2016
4 b 10/13/2016

;
RUN;

Tom
Super User Tom
Super User

If the treatment dataset only has one observation per ID then just use MERGE with ID as the BY variable.

If the treatment dataset has multiple observations per day then reduce it to one observation per day.  Do it in a way that meets your needs.  So if your example means that ID 4 took A and B on Sep21 then perhaps you want to set treatment to A+B instead.

data trt2;
  do until(last.drugdate);
     set trt;
     by id drugdate;
     length trt2 $100;
     trt2=catx('+',trt2,treatment);
  end;
run;
mkeintz
PROC Star

@Tom wrote:

If the treatment dataset only has one observation per ID then just use MERGE with ID as the BY variable.


Yes, but (1) there is a secondary by variable (date) that will force MERGE to drop the drugdate the moment the date changes.  The OP wants to keep the drug info for 14 days, and not propagate it further throughout the ID.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

@Manhort wrote:

This is awesome. And you actually made me realize this issue I do in fact have:

 

  1. HAVE2 never has more than one record for a given ID/DATE

 

If my have2 looks like this does this method just fall apart?

 

DATA have2;
input id treatment $ date mmddyy10.;
format date mmddyy10.;
CARDS;
4 a 9/21/2016

4 b 9/21/2016
4 b 10/13/2016

;
RUN;


The logic I showed would successfully propagate only the last DRUG data for a given DATE.   In the no-more-than-one-drug-per-date scenario, the latest drug just replaces any earlier drug that has not yet become stale (14 days).  The earlier drug only appears up to the day before the current drug.  In the case you present here (two drugs, same date) the second immediately displaces the first.

 

But then you haven't sufficiently described how you would want your results to look when there are multiple drugs per date. 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Manhort
Obsidian | Level 7

You're right. Maybe I should rethink the final structure I require now. Something like this maybe. I've added a second subject for context and the original subject now has an overlapping drug date. The 'want' now has dummy variables for whether a person received a treatment that day.

want

iddiseaseappointmentdrugDatetrt atrt b
3x5/9/20165/9/201610
3x5/9/20165/9/201610
3x5/12/20165/9/201610
3x5/30/2016 00
4x5/9/2016 00
4x5/9/2016 00
4x5/9/2016 00
4y5/17/2016 00
4y5/17/2016 00
4x9/21/20169/21/201611
4x9/21/20169/21/201611
4y9/22/20169/21/201611
4x9/22/20169/21/201611
4x9/22/20169/21/201611
4x9/22/20169/21/201611
4x9/22/20169/21/201611
4x9/22/20169/21/201611
4y10/13/201610/13/201601
4y10/13/201610/13/201601
4x10/26/201610/13/201601
4x10/26/201610/13/201601
4x12/5/2016 00

 

DATA have1;
input id disease $ appointment mmddyy10.;
format appointment mmddyy10.;
CARDS;
3 x 5/9/2016
3 x 5/9/2016
3 x 5/12/2016
3 x 5/30/2016
4 x 5/9/2016
4 x 5/9/2016
4 x 5/9/2016
4 y 5/17/2016
4 y 5/17/2016
4 x 9/21/2016
4 x 9/21/2016
4 y 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 x 9/22/2016
4 y 10/13/2016
4 y 10/13/2016
4 x 10/26/2016
4 x 10/26/2016
4 x 12/5/2016
;
RUN;

 

DATA have2;
input id treatment $ drugDate mmddyy10.;
format drugDate mmddyy10.;
CARDS;
3 a 5/9/2016
4 a 9/21/2016
4 b 9/21/2016
4 b 10/13/2016
;
RUN;

mkeintz
PROC Star

In HAVE2, I think either the 2nd or 3rd row should have treatment b.

 

Beyond that, I don't think your WANT is consistent with HAVE1 and HAVE2.  For example in ID4, how can you have DRUGDATE=10/14/2016, when HAVE2 has no such date.  And how can appointment on 10/26/2016 show no DRUGDATE or TRT values, when the most recent drugdate is 10/13/2016, less than 14 days prior?

 

And finally, please provide the sample data in the form of a DATA step, which makes it much easier, and therefore much more likely, that you will get responses that have been tested.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Manhort
Obsidian | Level 7

Sorry I made errors when pasting values on the tables. I fixed them above. I included a data step for the have1 and have2 data. 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1034 views
  • 3 likes
  • 3 in conversation