Solved: Re: merge two data sets with multiple records

elias_2020 · Posted 12-04-2017 07:43 PM

Hi there,

I would like to merge two data sets when the date of the first data set is equal or the closet date to the second date set date.

example

data diagonsed;
   input  id   diag_date mmddyy10.  disease $;
   format diag_date mmddyy10.;
   datalines;
1 05/23/10  C
1 05/26/10  C
1 01/01/11  B
;
run;



data disharge;
   input  id   dis_date mmddyy10.  disease $ hospital$;
   format dis_date mmddyy10.;
   datalines;
1 05/21/10  C    W
1 05/24/10  C    N
1 05/25/10  C    G
1 05/27/10  A    R
1 05/29/10  C    R
1 05/28/10  B    W
1 09/15/10  D    S
1 12/29/10  C    F
1 12/31/10  A    F
1 01/02/11  B    S
1 01/05/11  B    N
;
run;

/*i want to know the name of the hospital for those who were diagnosed for specific disease as in only diagnosed data set*/
/* The diagnosed date should be less or equal to the discharge date (minimum value=dis_date- diag_date) for the same type of disease in the first data set
*/
/* The correct answer should be as below: 
id   diag_date  dis_date  disease    hospital 
1    05/23/10   05/24/10     C             N
1    05/26/10   05/29/10     C             R
1    01/01/11   01/02/11     B              S

for the same type of disease in both data sets, difference_Date=min( dis_date - diag_date ) and it should equal zero or grater than zero.

Could you please give me any suggestion.

Thanks

Ksharp · Posted 12-05-2017 08:02 AM

How about this one ?

data diagnosed;
   input  id   diag_date mmddyy10.  disease $;
   format diag_date mmddyy10.;
   datalines;
1 05/23/10  C
1 05/26/10  C
1 01/01/11  B
;

data discharge;
   input  id   dis_date mmddyy10.  disease $ hospital$;
   format dis_date mmddyy10.;
   datalines;
1 05/21/10  C    W
1 05/24/10  C    N
1 05/25/10  C    G
1 05/27/10  A    R
1 05/29/10  C    R
1 05/28/10  B    W
1 09/15/10  D    S
1 12/29/10  C    F
1 12/31/10  A    F
1 01/02/11  B    S
1 01/05/11  B    N
;
proc sql;
select a.*,b.dis_date,b.hospital
 from diagnosed as a ,discharge as b
  where a.id=b.id and  b.dis_date-a.diag_date>0
   group by a.id,a.diag_date,a.disease
    having b.dis_date-a.diag_date=min(b.dis_date-a.diag_date);
quit;

View solution in original post

Reeza · Posted 12-04-2017 10:26 PM

Search on here. You’ll find plenty of examples.

You can join on nearest using SQL easier than you can via a data step.

If your data is not large, you can do a cross join and filter on anything with the date > date2. Then use a data step to pick the lowest/best match.

elias_2020 · Posted 12-04-2017 11:48 PM

Thank you Reeza.

My data sets is too large and I would like to merge the two data set if the type of disease for both data sets are the same and the date should be the nearest date.

PGStats · Posted 12-04-2017 11:59 PM

How about like this:

data diagnosed;
   input  id   diag_date mmddyy10.  disease $;
   format diag_date mmddyy10.;
   datalines;
1 05/23/10  C
1 05/26/10  C
1 01/01/11  B
;

data discharge;
   input  id   dis_date mmddyy10.  disease $ hospital$;
   format dis_date mmddyy10.;
   datalines;
1 05/21/10  C    W
1 05/24/10  C    N
1 05/25/10  C    G
1 05/27/10  A    R
1 05/29/10  C    R
1 05/28/10  B    W
1 09/15/10  D    S
1 12/29/10  C    F
1 12/31/10  A    F
1 01/02/11  B    S
1 01/05/11  B    N
;

proc sort data=diagnosed; by id disease diag_date; run;
proc sort data=discharge; by id disease dis_date; run;

data all;
retain diag_date;
merge 
    diagnosed(in=ina rename=diag_date=date) 
    discharge(in=inb rename=dis_date=date);
by id disease date;
if first.disease then call missing(diag_date);
if ina then diag_date = date;
if inb and not missing(diag_date) then do;
    dis_date=date;
    output;
    call missing(diag_date);
    end;
format diag_date dis_date mmddyy10.;
drop date;
run;

proc print noobs; run;

PG

mkeintz · Posted 12-05-2017 08:15 PM

I think this is a case where "set ... by ..." is simpler than "merge ... by ...",

data diagonsed;
  input  id   diag_date mmddyy10.  disease $;
  format diag_date mmddyy10.;
  datalines;
1 05/23/10  C
1 05/26/10  C
1 01/01/11  B
run;
data disharge;
  input  id   dis_date mmddyy10.  disease $ hospital$;
  format dis_date mmddyy10.;
  datalines;
1 05/21/10  C    W
1 05/24/10  C    N
1 05/25/10  C    G
1 05/27/10  A    R
1 05/29/10  C    R
1 05/28/10  B    W
1 09/15/10  D    S
1 12/29/10  C    F
1 12/31/10  A    F
1 01/02/11  B    S
1 01/05/11  B    N
run;

proc sort data=diagonsed; by id disease diag_date;run;
proc sort data=disharge; by id disease dis_date;run;


data want (rename=(date=dis_date));
  set diagonsed (in=indiag rename=(diag_date=date))
      disharge  (in=indis  rename=(dis_date=date));
  by id disease date;

  diag_date=lag(date); 
  format diag_date mmddyy10.;
  if indis=1 and lag(indiag)=1 and first.disease=0;
run;

The subsetting IF at the bottom does most of the work. It only keeps cases in which a discharge record immediately follows a diagnosis record. Adding the "and first.disease=0" condition ensures that the consecutive records are for the same id/disease.

While the "diag_date=lag(date)" sometimes will return a discharge date, all such unwanted cases are filtered out in the subsequent subsetting if.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 12-05-2017 08:02 AM

How about this one ?

data diagnosed;
   input  id   diag_date mmddyy10.  disease $;
   format diag_date mmddyy10.;
   datalines;
1 05/23/10  C
1 05/26/10  C
1 01/01/11  B
;

data discharge;
   input  id   dis_date mmddyy10.  disease $ hospital$;
   format dis_date mmddyy10.;
   datalines;
1 05/21/10  C    W
1 05/24/10  C    N
1 05/25/10  C    G
1 05/27/10  A    R
1 05/29/10  C    R
1 05/28/10  B    W
1 09/15/10  D    S
1 12/29/10  C    F
1 12/31/10  A    F
1 01/02/11  B    S
1 01/05/11  B    N
;
proc sql;
select a.*,b.dis_date,b.hospital
 from diagnosed as a ,discharge as b
  where a.id=b.id and  b.dis_date-a.diag_date>0
   group by a.id,a.diag_date,a.disease
    having b.dis_date-a.diag_date=min(b.dis_date-a.diag_date);
quit;

elias_2020 · Posted 12-05-2017 08:37 PM

Thanks Ksharp.

It works with me and I tried it for some cases and it gave me the correct answer.