Re: Merge/join date ranges

Allan_dk · Posted 09-11-2012 03:25 AM

How do I merge/join two datesets with date ranges?

I have two datasets a and b and they have four variables.

a: id, start, end, val_a

b: id, start, end, val_b

Date ranges in a single dataset for at single id is never overlapping, but there are missing periods.

Example:

A:

1, 2001-01-01. 2001-01-31, 100

1, 2001-02-01. 2001-03-31, 200

1, 2002-01-01. 2002-01-31, 300

B:

1, 2000-01-01, 2001-01-14, 111

1, 2001-01-20, 2001-08-31, 222

1, 2002-01-01, 2002-01-14, 333

Result shold be id, start, end, val_a, val_b

1, 2000-01-01, 2000-12-31, [missing], 111

1, 2001-01-01, 2001-01-14, 100, 111

1, 2001-01-15, 2001-01-19, 100, [missing]

1, 2001-01-20, 2001-01-31, 100, 222

1, 2001-02-01, 2001-03-31, 200, 222

1, 2001-03-01, 2001-08-31, [missing], 222

1, 2001-09-01, 2001-12-31, [missing], [missing]

1, 2002-01-01, 2002-01-14, [missing], 333

1, 2002-01-15, 2002-01-31, 300, 333

Thanks in advance

Ksharp · Posted 09-11-2012 03:59 AM

your question is indeed not easy.

data A;
input id start : yymmdd10. end : yymmdd10. val_a;
format start end yymmdd10.;
cards;
1 2001-01-01 2001-01-31 100
1 2001-02-01 2001-03-31 200
1 2002-01-01 2002-01-31 300
;
run;

data B;
input id start : yymmdd10. end : yymmdd10. val_b;
format start end yymmdd10.;
cards;
1 2000-01-01 2001-01-14 111
1 2001-01-20 2001-08-31 222
1 2002-01-01 2002-01-14 333
;
run;

data aa(drop=start end);
 set a;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
run;


data bb(drop=start end);
 set b;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
run;

data temp;
 merge aa bb;
 by id date;
 run;
data temp;
 set temp;
 if (val_a ne lag(val_a)) or (val_b ne lag(val_b)) then group+1;
run;
data want;
 set temp;
 by group;
 retain start;
 if first.group then start=date;
 if last.group then do;
                      end=date;output;
                     end;
 drop date group;
 format start end yymmdd10.;
run;
data want;
 merge want want(firstobs=2 keep=id start rename=(id=_id start=_start));
 output;
 if id=_id and end+1 ne _start then do;
                                      call missing(val_a,val_b);
                                      start=end+1;end=_start-1;output;
                                     end;
 drop _:;
run;

Ksharp

Allan_dk · Posted 09-11-2012 07:07 AM

Thanks.. it works, and with some small ajustments it works on real data.

(in the real data there are multible values val_a1, val_a2..... and often end = [missing]. )

But I was looking for at solution where I did't have to go through step aa and bb. The date ranges is quit big, multible years, and the aa and bb datasets becomes very large.

Astounding · Posted 09-11-2012 10:48 AM

It's a nice solution. First, to address your concerns, use views instead of data sets. Three lines have to change:

data aa (drop=start end) / view=aa;

data bb (drop=start end) / view=bb;

data temp / view=temp;

This probably won't impact the programming speed, but it will cut down on storage space requirements.

Also note, I'm not sure you really need the last step. It looks like an attempt to fill in all time periods, even if they don't all appear in the original data. You have to decide if that's a requirement or not.

Good luck.

Also note, once you use views, you can't use:

data temp;

set temp;

You would have to switch to:

data temp2 / view=temp2;

set temp;

Then use temp2 instead of temp later.

Ksharp · Posted 09-11-2012 10:42 PM

Sorry. I found a problem. If you have multiple ID , you need add one more condition to rectify it .

if (val_a ne lag(val_a)) or (val_b ne lag(val_b)) or id ne lag(id) then group+1;

AUTigers · Posted 09-12-2012 11:56 AM

Here is a way that you don't have to expand your date for each range. however, you still have to go through each A and B dataset and extract the unique START and END. then you can build a time table based upon it before merge with dataset A and B to get the ValA and ValB.

one assumption was made is no single date in your dataset can be both START and END. if you do, you need to give a rule to handle that.

here it goes,

data A;

infile cards dsd;

input ID start :yymmdd10. end :yymmdd10. valA;

format start end yymmddd10.;

cards;

1, 2001-01-01, 2001-01-31, 100

1, 2001-02-01, 2001-03-31, 200

1, 2002-01-01, 2002-01-31, 300

;

run;

data B;

infile cards dsd;

input ID start :yymmdd10. end :yymmdd10. valB;

format start end yymmddd10.;

cards;

1, 2000-01-01, 2001-01-14, 111

1, 2001-01-20, 2001-08-31, 222

1, 2002-01-01, 2002-01-14, 333

;

run;

/*Get unique time point from A*/

data DateA;

set A;

date=start; origin='S';

output;

date=end;origin='E';

output;

keep ID date origin ;

format date yymmddd10.;

run;

/*Get unique time point from B*/

data DateB;

set B;

date=start; origin='S';

output;

date=end;origin='E';

output;

keep ID date origin;

format date yymmddd10.;

run;

data alldate;

set dateA dateB;

run;

proc sort data=alldate nodupkey;

by ID date origin;

run;

/*merge date point from A and B with No by statement.

date -1, if origin of end date is the START in original dataset

date+1 if if origin of start date is the END in original dataset*/

data alldate1;

merge alldate alldate(firstobs=2 rename=(ID=nextID date=NewEND origin=nextori));

if nextori='S' and origin='S' then NewEND=NewEND-1;

else if origin='E' then date=date+1;

if ID~=nextID or NewEND=date then delete;

rename date=NewSTART;

drop origin nextori nextID;

run;

proc sql;

create table want (rename=(newstart=Start newEND=END)as

select fst.*,valB from

(select alldate1.*, valA

from alldate1 left join A on alldate1.ID=A.ID and

(newSTART>=start and NewEND<=END)) As fst left join B

on fst.ID=B.ID and

(fst.newstart>=B.start and fst.newEND<=end)

order by newstart ;

quit;

**********************output**************************************

Obs ID NewSTART NewEND valA valB

1 1 2000-01-01 2000-12-31 . 111

2 1 2001-01-01 2001-01-14 100 111

3 1 2001-01-15 2001-01-20 100 .

4 1 2001-01-20 2001-01-31 100 222

5 1 2001-02-01 2001-03-31 200 222

6 1 2001-04-01 2001-08-31 . 222

7 1 2001-09-01 2002-01-01 . .

8 1 2002-01-01 2002-01-14 300 333

9 1 2002-01-15 2002-01-31 300 .

HTH

Yu

Merge/join date ranges