DATA Step, Macro, Functions and more

Merge/join date ranges

Reply
Contributor
Posts: 40

Merge/join date ranges

How do I merge/join two datesets with date ranges?

I have two datasets a and b and they have four variables.

a: id, start, end, val_a

b: id, start, end, val_b

Date ranges in a single dataset for at single id is never overlapping, but there are missing periods.

Example:

A:

1, 2001-01-01. 2001-01-31, 100

1, 2001-02-01. 2001-03-31, 200

1, 2002-01-01. 2002-01-31, 300

B:

1, 2000-01-01, 2001-01-14, 111

1, 2001-01-20, 2001-08-31, 222

1, 2002-01-01, 2002-01-14, 333

Result shold be id, start, end, val_a, val_b

1, 2000-01-01,     2000-12-31,     [missing],     111

1, 2001-01-01,     2001-01-14,     100,             111

1, 2001-01-15,     2001-01-19,     100,             [missing]

1, 2001-01-20,     2001-01-31,     100,             222

1, 2001-02-01,     2001-03-31,     200,             222

1, 2001-03-01,     2001-08-31,     [missing],     222

1, 2001-09-01,     2001-12-31,     [missing],     [missing]

1, 2002-01-01,     2002-01-14,     [missing],     333

1, 2002-01-15,     2002-01-31,     300,             333

Thanks in advance Smiley Happy

Super User
Posts: 9,682

Re: Merge/join date ranges

your question is indeed not easy.

data A;
input id start : yymmdd10. end : yymmdd10. val_a;
format start end yymmdd10.;
cards;
1 2001-01-01 2001-01-31 100
1 2001-02-01 2001-03-31 200
1 2002-01-01 2002-01-31 300
;
run;

data B;
input id start : yymmdd10. end : yymmdd10. val_b;
format start end yymmdd10.;
cards;
1 2000-01-01 2001-01-14 111
1 2001-01-20 2001-08-31 222
1 2002-01-01 2002-01-14 333
;
run;

data aa(drop=start end);
 set a;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
run;


data bb(drop=start end);
 set b;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
run;

data temp;
 merge aa bb;
 by id date;
 run;
data temp;
 set temp;
 if (val_a ne lag(val_a)) or (val_b ne lag(val_b)) then group+1;
run;
data want;
 set temp;
 by group;
 retain start;
 if first.group then start=date;
 if last.group then do;
                      end=date;output;
                     end;
 drop date group;
 format start end yymmdd10.;
run;
data want;
 merge want want(firstobs=2 keep=id start rename=(id=_id start=_start));
 output;
 if id=_id and end+1 ne _start then do;
                                      call missing(val_a,val_b);
                                      start=end+1;end=_start-1;output;
                                     end;
 drop _:;
run;
 

Ksharp

Contributor
Posts: 40

Re: Merge/join date ranges

Thanks.. it works, and with some small ajustments it works on real data.

(in the real data there are multible values val_a1, val_a2..... and often end = [missing]. )

But I was looking for at solution where I did't have to go through step aa and bb. The date ranges is quit big, multible years, and the aa and bb datasets becomes very large.


Super User
Posts: 5,085

Re: Merge/join date ranges

It's a nice solution.  First, to address your concerns, use views instead of data sets.  Three lines have to change:

data aa (drop=start end) / view=aa;

data bb (drop=start end) / view=bb;

data temp / view=temp;

This probably won't impact the programming speed, but it will cut down on storage space requirements.

Also note, I'm not sure you really need the last step.  It looks like an attempt to fill in all time periods, even if they don't all appear in the original data.  You have to decide if that's a requirement or not.

Good luck.

Also note, once you use views, you can't use:

data temp;

set temp;

You would have to switch to:

data temp2 / view=temp2;

set temp;

Then use temp2 instead of temp later.

Super User
Posts: 9,682

Re: Merge/join date ranges

Sorry. I found a problem. If you have multiple ID , you need add one more condition to rectify it .

if (val_a ne lag(val_a)) or (val_b ne lag(val_b)) or id ne lag(id)   then group+1;

Contributor
Posts: 21

Re: Merge/join date ranges

Here is a way that you don't have to expand your date for each range. however, you still have to go through each A and B dataset and extract the unique START and END. then you can build a time table based upon it before merge with dataset A and B to get the ValA and ValB.

one assumption was made is no single date in your dataset can be both START and END. if you do, you need to give a rule to handle that.

here it goes,

data A;

infile cards dsd;

input ID start :yymmdd10. end :yymmdd10. valA;

format start end yymmddd10.;

cards;

1, 2001-01-01, 2001-01-31, 100

1, 2001-02-01, 2001-03-31, 200

1, 2002-01-01, 2002-01-31, 300

;

run;

data B;

infile cards dsd;

input ID start :yymmdd10. end :yymmdd10. valB;

format start end yymmddd10.;

cards;

1, 2000-01-01, 2001-01-14, 111

1, 2001-01-20, 2001-08-31, 222

1, 2002-01-01, 2002-01-14, 333

;

run;

/*Get unique time point from A*/

data DateA;

set A;

date=start; origin='S';

output;

date=end;origin='E';

output;

keep ID date origin ;

format date yymmddd10.;

run;

/*Get unique time point from B*/

data DateB;

set B;

date=start; origin='S';

output;

date=end;origin='E';

output;

keep ID date origin;

format date yymmddd10.;

run;

data alldate;

set dateA dateB;

run;

proc sort data=alldate nodupkey;

by ID date  origin;

run;

/*merge date point from A and B with No by statement.

   date -1, if origin of end date is the START in original dataset

   date+1 if if origin of start date is the END in original dataset*/

data alldate1;

merge alldate alldate(firstobs=2 rename=(ID=nextID date=NewEND origin=nextori));

if nextori='S' and origin='S' then NewEND=NewEND-1;

else if origin='E' then date=date+1;

if ID~=nextID or NewEND=date then delete;

rename date=NewSTART;

drop origin nextori nextID;

run;

proc sql;

create table want (rename=(newstart=Start newEND=END)as

select fst.*,valB from

(select  alldate1.*, valA

      from alldate1  left join A  on alldate1.ID=A.ID and

           (newSTART>=start and NewEND<=END)) As fst left join B

      on fst.ID=B.ID and

           (fst.newstart>=B.start and fst.newEND<=end)

      

               order by newstart        ;

quit;

**********************output**************************************

                     Obs    ID      NewSTART        NewEND    valA    valB

                      1      1    2000-01-01    2000-12-31       .     111

                      2      1    2001-01-01    2001-01-14     100     111

                      3      1    2001-01-15    2001-01-20     100       .

                      4      1    2001-01-20    2001-01-31     100     222

                      5      1    2001-02-01    2001-03-31     200     222

                      6      1    2001-04-01    2001-08-31       .     222

                      7      1    2001-09-01    2002-01-01       .       .

                      8      1    2002-01-01    2002-01-14     300     333

                      9      1    2002-01-15    2002-01-31     300       .

HTH

Yu

Ask a Question
Discussion stats
  • 5 replies
  • 1382 views
  • 6 likes
  • 4 in conversation