Hello
I have 2 data sets:
data set a: columns: ID, date, Y
Each ID has only one raw in data set a
data set b: columns: ID , date ,W
Each ID has multiple rows (with different values of dates and W)
I want to do merge of data set a with data set b (left join ).
Since dates are not always matched between data set a and data set b , then I want to merge by the closet date.
IF there are 2 dates in data set b that equally close to date in data set a then will take the earlier date.
What is the way to perform such merge?
data a;
format date date9.;
input id date :date9. y;
cards;
1 '22jan2021'd 10
2 '24jan2021'd 20
3 '22jan2021'd 30
;
run;
data b;
format date date9.;
input id date :date9. w;
cards;
1 '21jan2021'd 11
1 '23jan2021'd 12
1 '24jan2021'd 15
1 '18jan2021'd 13
2 '24jan2021'd 22
2 '23jan2021'd 26
2 '21jan2021'd 24
3 '23jan2021'd 35
3 '24jan2021'd 32
;
run;
proc sort data=a;
by id date;
run;
proc sort data=b;
by id date;
run;
data wanted;
merge a b;
by id and closet date;
Run;
Here is one way.
Let me know if this works for you? 🙂
data want(drop = dif d ww);
if _N_ = 1 then do;
dcl hash h(dataset : 'b(rename = date = d)', multidata : 'Y');
h.definekey('id');
h.definedata('d', 'w');
h.definedone();
end;
set a;
w = .; d = .; dif = 9999999;
do while (h.do_over(key : id) = 0);
if abs(date - d) < dif then do;
ww = w;
dif = abs(date - d);
end;
end;
w = ww;
run;
Result:
Obs date id y w 1 22JAN2021 1 10 11 2 24JAN2021 2 20 22 3 22JAN2021 3 30 35
Another method, if the dataset is too large to fit in memory:
data a;
input id date :date9. y;
format date date9.;
cards;
1 '22jan2021'd 10
2 '24jan2021'd 20
3 '22jan2021'd 30
;
data b;
input id date :date9. w;
format date date9.;
cards;
1 '21jan2021'd 11
1 '23jan2021'd 12
1 '24jan2021'd 15
1 '18jan2021'd 13
2 '24jan2021'd 22
2 '23jan2021'd 26
2 '21jan2021'd 24
3 '23jan2021'd 35
3 '24jan2021'd 32
;
proc sort data=b;
by id date;
run;
data want;
merge
a (in=_a)
b (rename=(date=_date w=_w))
;
by id;
if _a;
retain _diff b_date w;
format b_date date9.;
if first.id then _diff = 999999;
_dif = abs(date - _date);
if _dif lt _diff
then do;
b_date =_date;
w = _w;
_diff = _dif;
end;
if last.id;
drop _date _w _diff _dif;
run;
Hi @Ronein Not a fancy solution, however we have this in our production code that queries Customer transaction. Of course a bit too old school, nevertheless easy-
data a;
format date date9.;
input id date :date9. y;
cards;
1 '22jan2021'd 10
2 '24jan2021'd 20
3 '22jan2021'd 30
;
run;
data b;
format date date9.;
input id date :date9. w;
cards;
1 '21jan2021'd 11
1 '23jan2021'd 12
1 '24jan2021'd 15
1 '18jan2021'd 13
2 '24jan2021'd 22
2 '23jan2021'd 26
2 '21jan2021'd 24
3 '23jan2021'd 35
3 '24jan2021'd 32
;
run;
proc sql;
create table want(drop=_:) as
select *
from (select a.*,w,b.date as _bd ,abs(a.date-b.date) as _diff
from a a left join b b
on a.id=b.id
group by a.id
having min(_diff)=_diff)
group by id, _diff
having min(_bd)=_bd;
quit;
Out of curiosity, I did a compare of the three suggested methods, with some souped-up data:
data a;
do id = 1 to 1000000;
date = rand('integer','01jan2019'd,'31dec2020'd);
y = rand('uniform',0,100);
output;
end;
format date date9.;
run;
data b;
do id = 1 to 1000000;
do i = 1 to rand('integer',1,20);
date = rand('integer','01jan2019'd,'31dec2020'd);
w = rand('uniform',0,100);
output;
end;
end;
format date date9.;
drop i;
run;
proc sort data=b nodupkey;
by id date;
run;
proc sql;
create table want(drop=_:) as
select *
from (select a.*,w,b.date as _bd ,abs(a.date-b.date) as _diff
from a a left join b b
on a.id=b.id
group by a.id
having min(_diff)=_diff)
group by id, _diff
having min(_bd)=_bd;
quit;
data want;
merge
a (in=_a)
b (rename=(date=_date w=_w))
;
by id;
if _a;
retain _diff b_date w;
format b_date date9.;
if first.id then _diff = 999999;
_dif = abs(date - _date);
if _dif lt _diff
then do;
b_date =_date;
w = _w;
_diff = _dif;
end;
if last.id;
drop _date _w _diff _dif;
run;
data want(drop = dif d ww);
if _N_ = 1 then do;
dcl hash h(dataset : 'b(rename = date = d)', multidata : 'Y');
h.definekey('id');
h.definedata('d', 'w');
h.definedone();
end;
set a;
w = .; d = .; dif = 9999999;
do while (h.do_over(key : id) = 0);
if abs(date - d) < dif then do;
ww = w;
dif = abs(date - d);
end;
end;
w = ww;
run;
This is the log (University Edition on a 2015 MacBook Pro):
72 73 data a; 74 do id = 1 to 1000000; 75 date = rand('integer','01jan2019'd,'31dec2020'd); 76 y = rand('uniform',0,100); 77 output; 78 end; 79 format date date9.; 80 run; NOTE: The data set WORK.A has 1000000 observations and 3 variables. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 0.18 seconds cpu time 0.15 seconds 81 82 data b; 83 do id = 1 to 1000000; 84 do i = 1 to rand('integer',1,20); 85 date = rand('integer','01jan2019'd,'31dec2020'd); 86 w = rand('uniform',0,100); 87 output; 88 end; 89 end; 90 format date date9.; 91 drop i; 92 run; NOTE: The data set WORK.B has 10495068 observations and 3 variables. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 4.47 seconds cpu time 1.57 seconds 93 94 proc sort data=b nodupkey; 95 by id date; 96 run; NOTE: There were 10495068 observations read from the data set WORK.B. NOTE: 89968 observations with duplicate key values were deleted. NOTE: The data set WORK.B has 10405100 observations and 3 variables. NOTE: Verwendet wurde: PROZEDUR SORT - (Gesamtverarbeitungszeit): real time 6.51 seconds cpu time 8.57 seconds 97 98 proc sql; 99 create table want(drop=_:) as 100 select * 101 from (select a.*,w,b.date as _bd ,abs(a.date-b.date) as _diff 102 from a a left join b b 103 on a.id=b.id 104 group by a.id 105 having min(_diff)=_diff) 106 group by id, _diff 107 having min(_bd)=_bd; NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.WANT created, with 1000000 rows and 4 columns. 108 quit; NOTE: Verwendet wurde: PROZEDUR SQL - (Gesamtverarbeitungszeit): real time 4.37 seconds cpu time 4.57 seconds 109 110 data want; 111 merge 112 a (in=_a) 113 b (rename=(date=_date w=_w)) 114 ; 115 by id; 116 if _a; 117 retain _diff b_date w; 118 format b_date date9.; 119 if first.id then _diff = 999999; 120 _dif = abs(date - _date); 121 if _dif lt _diff 122 then do; 123 b_date =_date; 124 w = _w; 125 _diff = _dif; 126 end; 127 if last.id; 128 drop _date _w _diff _dif; 129 run; NOTE: There were 1000000 observations read from the data set WORK.A. NOTE: There were 10405100 observations read from the data set WORK.B. NOTE: The data set WORK.WANT has 1000000 observations and 5 variables. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 3.50 seconds cpu time 1.51 seconds 130 131 data want(drop = dif d ww); 132 if _N_ = 1 then do; 133 dcl hash h(dataset : 'b(rename = date = d)', multidata : 'Y'); 134 h.definekey('id'); 135 h.definedata('d', 'w'); 136 h.definedone(); 137 end; 138 139 set a; 140 w = .; d = .; dif = 9999999; 141 142 do while (h.do_over(key : id) = 0); 143 if abs(date - d) < dif then do; 144 ww = w; 145 dif = abs(date - d); 146 end; 147 end; 148 149 w = ww; 150 run; NOTE: There were 10405100 observations read from the data set WORK.B. NOTE: There were 1000000 observations read from the data set WORK.A. NOTE: The data set WORK.WANT has 1000000 observations and 4 variables. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 7.11 seconds cpu time 6.34 seconds
Unsurprisingly, the DATA step method outperforms the others, IF(!) the datasets are already sorted accordingly.
And I found that an additional information is necessary to come up with a "perfect" solution: what if dataset B has more than one observation per id and date, or is this impossible, as the data is already summarized per day?
Thank you Sir @Kurt_Bremser for the extended effort and offering your time. That's the kind of effort in my humble opinion deserves praise and generally often go unnoticed. Kudos!
I have to agree with you the Datastep will outperform the SQL at most times. For one, that's a work horse and the approach is a one step peek and pick. On the other hand, as we know SQL pays a performance penalty in its reemerging the summary and followed by subsetting based on a condition. There are some exceptions though, when the query is pushed in DB with a SQL pass through, which normally is the case here where I work, the story is very different.
I was just having a phone conversation with Guru Paul D other day, as we often have our banters asking for his opinion on what he would deem the right fit. He says there is no right or wrong as long as the Business needs are met within the expected time for decision making. Also, it depends on the company or the organization that lays emphasis on due diligence in performance.
At my work place at least in credit risk, the strategy is mostly Business analysis and very little emphasis is placed on "high performance coding" unless some code happens to be so horrendous that it takes forever to complete that becomes a show stopper. They consider performance is a good to have if the delta is only small albeit if the delta is days, yep they would buy all the fancy coding. Otherwise, the attitude is to focus on loss mitigation and risk appetite than coding.
Thank you for making the thread interesting. 🙂
Another variant for fun, since the tables are sorted by ID and Date-
data want;
if _n_=1 then do;
dcl hash h(dataset:'b(obs=0 rename=(date=_bd)');
h.definekey('id');
h.definedata(all:'y');
h.definedone();
end;
_k1=constant('big');
do until(last.id);
merge a b(rename=date=_bd);
by id;
_k=abs(date-_bd);
if _k<_k1 then do;
h.replace();
_k1=_k;
end;
end;
_n_=h.find();
h.clear();
drop _:;
run;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.