BookmarkSubscribeRSS Feed
Ronein
Onyx | Level 15

Hello

I have 2 data sets:

data set a:  columns: ID, date, Y

Each ID has only one raw in data set a

 

data set b: columns: ID , date ,W

Each ID has multiple rows (with different values of dates and W)

 

I want to do merge of data set a with data set b (left join ).

Since dates are not always matched between data set a and data set b , then I want to merge by the closet date.

IF there are 2 dates in data set b that equally  close to date in data set a then will take the earlier date.

What is the way to perform such merge?


data a;
format date date9.;
input id date :date9. y;
cards;
1 '22jan2021'd 10
2 '24jan2021'd 20
3 '22jan2021'd 30
;
run;


data b;
format date date9.;
input id date :date9. w;
cards;
1 '21jan2021'd 11
1 '23jan2021'd 12
1 '24jan2021'd 15
1 '18jan2021'd 13
2 '24jan2021'd 22
2 '23jan2021'd 26
2 '21jan2021'd 24
3 '23jan2021'd 35
3 '24jan2021'd 32
;
run;

proc sort data=a;
by id date;
run;
proc sort data=b;
by id date;
run;

data wanted;
merge a b;
by id and closet date;
Run;

 

 

 

6 REPLIES 6
PeterClemmensen
Tourmaline | Level 20

Here is one way. 

 

Let me know if this works for you? 🙂

 

data want(drop = dif d ww);
   if _N_ = 1 then do;
      dcl hash h(dataset : 'b(rename = date = d)', multidata : 'Y');
      h.definekey('id');
      h.definedata('d', 'w');
      h.definedone();
   end;

   set a;
   w = .; d = .; dif = 9999999;

   do while (h.do_over(key : id) = 0);
      if abs(date - d) < dif then do;
         ww = w;
         dif = abs(date - d);
      end;
   end;

   w = ww;
run;

 

Result:

 

Obs date       id  y   w 
1   22JAN2021  1   10  11 
2   24JAN2021  2   20  22 
3   22JAN2021  3   30  35 
Kurt_Bremser
Super User

Another method, if the dataset is too large to fit in memory:

data a;
input id date :date9. y;
format date date9.;
cards;
1 '22jan2021'd 10
2 '24jan2021'd 20
3 '22jan2021'd 30
;

data b;
input id date :date9. w;
format date date9.;
cards;
1 '21jan2021'd 11
1 '23jan2021'd 12
1 '24jan2021'd 15
1 '18jan2021'd 13
2 '24jan2021'd 22
2 '23jan2021'd 26
2 '21jan2021'd 24
3 '23jan2021'd 35
3 '24jan2021'd 32
;

proc sort data=b;
by id date;
run;

data want;
merge
  a (in=_a)
  b (rename=(date=_date w=_w))
;
by id;
if _a;
retain _diff b_date w;
format b_date date9.;
if first.id then _diff = 999999;
_dif = abs(date - _date);
if _dif lt _diff
then do;
  b_date =_date;
  w = _w;
  _diff = _dif;
end;
if last.id;
drop _date _w _diff _dif;
run;
novinosrin
Tourmaline | Level 20

Hi @Ronein  Not a fancy solution, however we have this in our production code that queries Customer transaction. Of course a bit too old school, nevertheless easy-


data a;
format date date9.;
input id date :date9. y;
cards;
1 '22jan2021'd 10
2 '24jan2021'd 20
3 '22jan2021'd 30
;
run;


data b;
format date date9.;
input id date :date9. w;
cards;
1 '21jan2021'd 11
1 '23jan2021'd 12
1 '24jan2021'd 15
1 '18jan2021'd 13
2 '24jan2021'd 22
2 '23jan2021'd 26
2 '21jan2021'd 24
3 '23jan2021'd 35
3 '24jan2021'd 32
;
run;


proc sql;
 create table want(drop=_:) as
 select *
 from (select a.*,w,b.date as _bd ,abs(a.date-b.date) as _diff
 from a a left join b b
 on a.id=b.id
 group by a.id
 having min(_diff)=_diff)
 group by id, _diff
 having min(_bd)=_bd;
quit;

 

Kurt_Bremser
Super User

Out of curiosity, I did a compare of the three suggested methods, with some souped-up data:

data a;
do id = 1 to 1000000;
  date = rand('integer','01jan2019'd,'31dec2020'd);
  y = rand('uniform',0,100);
  output;
end;
format date date9.;
run;

data b;
do id = 1 to 1000000;
  do i = 1 to rand('integer',1,20);
    date = rand('integer','01jan2019'd,'31dec2020'd);
    w = rand('uniform',0,100);
    output;
  end;
end;
format date date9.;
drop i;
run;

proc sort data=b nodupkey;
by id date;
run;

proc sql;
 create table want(drop=_:) as
 select *
 from (select a.*,w,b.date as _bd ,abs(a.date-b.date) as _diff
 from a a left join b b
 on a.id=b.id
 group by a.id
 having min(_diff)=_diff)
 group by id, _diff
 having min(_bd)=_bd;
quit;

data want;
merge
  a (in=_a)
  b (rename=(date=_date w=_w))
;
by id;
if _a;
retain _diff b_date w;
format b_date date9.;
if first.id then _diff = 999999;
_dif = abs(date - _date);
if _dif lt _diff
then do;
  b_date =_date;
  w = _w;
  _diff = _dif;
end;
if last.id;
drop _date _w _diff _dif;
run;

data want(drop = dif d ww);
   if _N_ = 1 then do;
      dcl hash h(dataset : 'b(rename = date = d)', multidata : 'Y');
      h.definekey('id');
      h.definedata('d', 'w');
      h.definedone();
   end;

   set a;
   w = .; d = .; dif = 9999999;

   do while (h.do_over(key : id) = 0);
      if abs(date - d) < dif then do;
         ww = w;
         dif = abs(date - d);
      end;
   end;

   w = ww;
run;

This is the log (University Edition on a 2015 MacBook Pro):

 72         
 73         data a;
 74         do id = 1 to 1000000;
 75           date = rand('integer','01jan2019'd,'31dec2020'd);
 76           y = rand('uniform',0,100);
 77           output;
 78         end;
 79         format date date9.;
 80         run;
 
 NOTE: The data set WORK.A has 1000000 observations and 3 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.18 seconds
       cpu time            0.15 seconds
       
 
 81         
 82         data b;
 83         do id = 1 to 1000000;
 84           do i = 1 to rand('integer',1,20);
 85             date = rand('integer','01jan2019'd,'31dec2020'd);
 86             w = rand('uniform',0,100);
 87             output;
 88           end;
 89         end;
 90         format date date9.;
 91         drop i;
 92         run;
 
 NOTE: The data set WORK.B has 10495068 observations and 3 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           4.47 seconds
       cpu time            1.57 seconds
       
 
 93         
 94         proc sort data=b nodupkey;
 95         by id date;
 96         run;
 
 NOTE: There were 10495068 observations read from the data set WORK.B.
 NOTE: 89968 observations with duplicate key values were deleted.
 NOTE: The data set WORK.B has 10405100 observations and 3 variables.
 NOTE:  Verwendet wurde: PROZEDUR SORT - (Gesamtverarbeitungszeit):
       real time           6.51 seconds
       cpu time            8.57 seconds
       
 
 97         
 98         proc sql;
 99          create table want(drop=_:) as
 100         select *
 101         from (select a.*,w,b.date as _bd ,abs(a.date-b.date) as _diff
 102         from a a left join b b
 103         on a.id=b.id
 104         group by a.id
 105         having min(_diff)=_diff)
 106         group by id, _diff
 107         having min(_bd)=_bd;
 NOTE: The query requires remerging summary statistics back with the original data.
 NOTE: Table WORK.WANT created, with 1000000 rows and 4 columns.
 
 108        quit;
 NOTE:  Verwendet wurde: PROZEDUR SQL - (Gesamtverarbeitungszeit):
       real time           4.37 seconds
       cpu time            4.57 seconds
       
 
 109        
 110        data want;
 111        merge
 112          a (in=_a)
 113          b (rename=(date=_date w=_w))
 114        ;
 115        by id;
 116        if _a;
 117        retain _diff b_date w;
 118        format b_date date9.;
 119        if first.id then _diff = 999999;
 120        _dif = abs(date - _date);
 121        if _dif lt _diff
 122        then do;
 123          b_date =_date;
 124          w = _w;
 125          _diff = _dif;
 126        end;
 127        if last.id;
 128        drop _date _w _diff _dif;
 129        run;
 
 NOTE: There were 1000000 observations read from the data set WORK.A.
 NOTE: There were 10405100 observations read from the data set WORK.B.
 NOTE: The data set WORK.WANT has 1000000 observations and 5 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           3.50 seconds
       cpu time            1.51 seconds
       
 
 130        
 131        data want(drop = dif d ww);
 132           if _N_ = 1 then do;
 133              dcl hash h(dataset : 'b(rename = date = d)', multidata : 'Y');
 134              h.definekey('id');
 135              h.definedata('d', 'w');
 136              h.definedone();
 137           end;
 138        
 139           set a;
 140           w = .; d = .; dif = 9999999;
 141        
 142           do while (h.do_over(key : id) = 0);
 143              if abs(date - d) < dif then do;
 144                 ww = w;
 145                 dif = abs(date - d);
 146              end;
 147           end;
 148        
 149           w = ww;
 150        run;
 
 NOTE: There were 10405100 observations read from the data set WORK.B.
 NOTE: There were 1000000 observations read from the data set WORK.A.
 NOTE: The data set WORK.WANT has 1000000 observations and 4 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           7.11 seconds
       cpu time            6.34 seconds

Unsurprisingly, the DATA step method outperforms the others, IF(!) the datasets are already sorted accordingly.

 

And I found that an additional information is necessary to come up with a "perfect" solution: what if dataset B has more than one observation per id and date, or is this impossible, as the data is already summarized per day?

novinosrin
Tourmaline | Level 20

Thank you Sir @Kurt_Bremser  for the extended effort and offering your time. That's the kind of effort in my humble opinion deserves praise and generally often go unnoticed. Kudos! 

 

I have to agree with you the Datastep will outperform the SQL at most times. For one, that's a work horse and the approach is a one step peek and pick. On the other hand, as we know SQL pays a performance penalty in its reemerging the summary and followed by subsetting based on a condition. There are some exceptions though, when the query is pushed in DB with a SQL pass through, which normally is the case here where I work, the story is very different. 

 

I was just having a phone conversation with Guru Paul D other day, as we often have our banters asking for his opinion on what he would deem the right fit. He says there is no right or wrong as long as the Business needs are met within the expected time for decision making. Also, it depends on the company or the organization that lays emphasis on due diligence in performance.

 

At my work place at least in credit risk, the strategy is mostly Business analysis and very little emphasis is placed on "high performance coding" unless some code happens to be so horrendous that it takes forever to complete that becomes a show stopper. They consider performance is a good to have if the delta is only small albeit if the delta is days, yep they would buy all the fancy coding. Otherwise, the attitude is to focus on loss mitigation and risk appetite than coding.  

 

Thank you for making the thread interesting. 🙂

novinosrin
Tourmaline | Level 20

Another variant for fun, since the tables are sorted by ID and Date-


data want;
 if _n_=1 then do;
  dcl hash h(dataset:'b(obs=0 rename=(date=_bd)');
  h.definekey('id');
  h.definedata(all:'y');
  h.definedone();
 end;
 _k1=constant('big');
 do until(last.id);
  merge a b(rename=date=_bd);
  by id;
  _k=abs(date-_bd);
  if _k<_k1 then do;
   h.replace();
   _k1=_k;
  end;
 end;
 _n_=h.find();
 h.clear();
 drop _:;
run;

 

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1442 views
  • 4 likes
  • 4 in conversation