How do I merge/join two datesets with date ranges?
I have two datasets a and b and they have four variables.
a: id, start, end, val_a
b: id, start, end, val_b
Date ranges in a single dataset for at single id is never overlapping, but there are missing periods.
Example:
A:
1, 2001-01-01. 2001-01-31, 100
1, 2001-02-01. 2001-03-31, 200
1, 2002-01-01. 2002-01-31, 300
B:
1, 2000-01-01, 2001-01-14, 111
1, 2001-01-20, 2001-08-31, 222
1, 2002-01-01, 2002-01-14, 333
Result shold be id, start, end, val_a, val_b
1, 2000-01-01, 2000-12-31, [missing], 111
1, 2001-01-01, 2001-01-14, 100, 111
1, 2001-01-15, 2001-01-19, 100, [missing]
1, 2001-01-20, 2001-01-31, 100, 222
1, 2001-02-01, 2001-03-31, 200, 222
1, 2001-03-01, 2001-08-31, [missing], 222
1, 2001-09-01, 2001-12-31, [missing], [missing]
1, 2002-01-01, 2002-01-14, [missing], 333
1, 2002-01-15, 2002-01-31, 300, 333
Thanks in advance
your question is indeed not easy.
data A; input id start : yymmdd10. end : yymmdd10. val_a; format start end yymmdd10.; cards; 1 2001-01-01 2001-01-31 100 1 2001-02-01 2001-03-31 200 1 2002-01-01 2002-01-31 300 ; run; data B; input id start : yymmdd10. end : yymmdd10. val_b; format start end yymmdd10.; cards; 1 2000-01-01 2001-01-14 111 1 2001-01-20 2001-08-31 222 1 2002-01-01 2002-01-14 333 ; run; data aa(drop=start end); set a; do date=start to end; output; end; format date yymmdd10.; run; data bb(drop=start end); set b; do date=start to end; output; end; format date yymmdd10.; run; data temp; merge aa bb; by id date; run; data temp; set temp; if (val_a ne lag(val_a)) or (val_b ne lag(val_b)) then group+1; run; data want; set temp; by group; retain start; if first.group then start=date; if last.group then do; end=date;output; end; drop date group; format start end yymmdd10.; run; data want; merge want want(firstobs=2 keep=id start rename=(id=_id start=_start)); output; if id=_id and end+1 ne _start then do; call missing(val_a,val_b); start=end+1;end=_start-1;output; end; drop _:; run;
Ksharp
Thanks.. it works, and with some small ajustments it works on real data.
(in the real data there are multible values val_a1, val_a2..... and often end = [missing]. )
But I was looking for at solution where I did't have to go through step aa and bb. The date ranges is quit big, multible years, and the aa and bb datasets becomes very large.
It's a nice solution. First, to address your concerns, use views instead of data sets. Three lines have to change:
data aa (drop=start end) / view=aa;
data bb (drop=start end) / view=bb;
data temp / view=temp;
This probably won't impact the programming speed, but it will cut down on storage space requirements.
Also note, I'm not sure you really need the last step. It looks like an attempt to fill in all time periods, even if they don't all appear in the original data. You have to decide if that's a requirement or not.
Good luck.
Also note, once you use views, you can't use:
data temp;
set temp;
You would have to switch to:
data temp2 / view=temp2;
set temp;
Then use temp2 instead of temp later.
Sorry. I found a problem. If you have multiple ID , you need add one more condition to rectify it .
if (val_a ne lag(val_a)) or (val_b ne lag(val_b)) or id ne lag(id) then group+1;
Here is a way that you don't have to expand your date for each range. however, you still have to go through each A and B dataset and extract the unique START and END. then you can build a time table based upon it before merge with dataset A and B to get the ValA and ValB.
one assumption was made is no single date in your dataset can be both START and END. if you do, you need to give a rule to handle that.
here it goes,
data A;
infile cards dsd;
input ID start :yymmdd10. end :yymmdd10. valA;
format start end yymmddd10.;
cards;
1, 2001-01-01, 2001-01-31, 100
1, 2001-02-01, 2001-03-31, 200
1, 2002-01-01, 2002-01-31, 300
;
run;
data B;
infile cards dsd;
input ID start :yymmdd10. end :yymmdd10. valB;
format start end yymmddd10.;
cards;
1, 2000-01-01, 2001-01-14, 111
1, 2001-01-20, 2001-08-31, 222
1, 2002-01-01, 2002-01-14, 333
;
run;
/*Get unique time point from A*/
data DateA;
set A;
date=start; origin='S';
output;
date=end;origin='E';
output;
keep ID date origin ;
format date yymmddd10.;
run;
/*Get unique time point from B*/
data DateB;
set B;
date=start; origin='S';
output;
date=end;origin='E';
output;
keep ID date origin;
format date yymmddd10.;
run;
data alldate;
set dateA dateB;
run;
proc sort data=alldate nodupkey;
by ID date origin;
run;
/*merge date point from A and B with No by statement.
date -1, if origin of end date is the START in original dataset
date+1 if if origin of start date is the END in original dataset*/
data alldate1;
merge alldate alldate(firstobs=2 rename=(ID=nextID date=NewEND origin=nextori));
if nextori='S' and origin='S' then NewEND=NewEND-1;
else if origin='E' then date=date+1;
if ID~=nextID or NewEND=date then delete;
rename date=NewSTART;
drop origin nextori nextID;
run;
proc sql;
create table want (rename=(newstart=Start newEND=END)as
select fst.*,valB from
(select alldate1.*, valA
from alldate1 left join A on alldate1.ID=A.ID and
(newSTART>=start and NewEND<=END)) As fst left join B
on fst.ID=B.ID and
(fst.newstart>=B.start and fst.newEND<=end)
order by newstart ;
quit;
**********************output**************************************
Obs ID NewSTART NewEND valA valB
1 1 2000-01-01 2000-12-31 . 111
2 1 2001-01-01 2001-01-14 100 111
3 1 2001-01-15 2001-01-20 100 .
4 1 2001-01-20 2001-01-31 100 222
5 1 2001-02-01 2001-03-31 200 222
6 1 2001-04-01 2001-08-31 . 222
7 1 2001-09-01 2002-01-01 . .
8 1 2002-01-01 2002-01-14 300 333
9 1 2002-01-15 2002-01-31 300 .
HTH
Yu
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.