Conducting a fuzzy merge on dates while specifying a minimum difference between dates

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

Conducting a fuzzy merge on dates while specifying a minimum difference between dates

I have been using the following syntax to conduct a fuzzy merge on dates between two tables:

 

proc sql;
  create table want as
    select 
      a.*,
      b.start as r_start format=date9.,
      b.end   as r_end format=date9.,
      b.var1,
      b.var2,
      b.var3
    from 
      a left join b
        on a.id=b.id
      group by a.id, a.start
      having min((a.start-b.start)) = ((a.start-b.start))
    ;
quit;

As an additional step, I need to specify that the minimum value of (a.start-b.start) is the minimum value that is >= 0. Essentially what I am trying to do is identify the closest date to a.start in table b, but with the added caveat that b.start has to have occurred before or on the same day as a.start


Accepted Solutions
Solution
4 weeks ago
Esteemed Advisor
Posts: 5,335

Re: Conducting a fuzzy merge on dates while specifying a minimum difference between dates

[ Edited ]

Note that within a a.start group, the value of a.start is constant. So the condition

 

min((a.start-b.start)) = ((a.start-b.start))

 

is equivalent to

 

min(-b.start) = -b.start

 

which is the same as

 

max(b.start) = b.start

 

you could thus use the query

 

proc sql;
  create table want as
    select 
      a.*,
      b.start as r_start format=date9.,
      b.end   as r_end format=date9.,
      b.var1,
      b.var2,
      b.var3
    from 
      a left join b
        on a.id=b.id and a.start >= b.start
      group by a.id, a.start
      having max(b.start) = b.start
    ;
quit;

 

PG

View solution in original post


All Replies
Respected Advisor
Posts: 4,469

Re: Conducting a fuzzy merge on dates while specifying a minimum difference between dates

[ Edited ]

@djohn051

Couldn't you just filter out these cases during the joinl?

on a.id=b.id and (a.start-b.start) >=0

Above will also filter cases where b.start is missing because there was no matching record in table with alias B (so actually: you could also go for an inner join which would perform better). If you want to include non matching records in your final result then modify the join condition to:

on a.id=b.id and (a.start-b.start >0 or b.start is null)

 

 

Regular Contributor
Regular Contributor
Posts: 249

Re: Conducting a fuzzy merge on dates while specifying a minimum difference between dates

[ Edited ]

EDIT:  I JUST SAW THIS HAS BEEN DISCUSSED AND SOLVED IN ANOTHER THREAD.  DON'T DOUBLE POST UP LIKE THAT.  I'LL LEAVE THIS ANSWER IN CASE IT APPLIES TO SOMEONE ELSE'S PROBLEM.

 

 

 

You could do it less elegantly with a subquery.

 

You don't post any sample data and not real good guidance but given data like

 

 

Data aa;
input id one mydate:mmddyy10. something:$6.;
format mydate yymmdd10.;
datalines;
111 100 6/1/2014  yergle
112 200 7/2/2014  omnipt
113 500 8/4/2014  weezle
;
run;


Data bb;
input id one mydate:mmddyy10. var1:$3. var2:$3. var3:$3.;
format mydate yymmdd10.;
datalines;
111 100 7/1/2014  AAA bbb ccc
111 200 5/30/2014  bbb ccc ddd
111 500 5/15/2014  ddd eee fff
112 100 7/3/2014  AAA xxx sss
112 200 7/4/2014  AAA zzz qqq
112 500 7/5/2014  ABC vvv bbb
113 100 6/1/2014  AAA nnn mmm
113 200 6/2/2014  AAA eee ttt
113 500 6/4/2014  ABC fff ggg
;
run;

if I understand correctly you would want data from BB records with id of 111 and var values of bbb ccc ddd and 113 ABC fff ggg

 

A subquery could find the closest record and then you can join back.

 

proc sql;
	select bb.*, aa.something
	from (
		select bb.id, max(bb.mydate) as myccdate
		from bb 
		inner join aa on bb.id = aa.id
		where bb.mydate <= aa.mydate
		group by bb.id
		) as cc
	left join bb on bb.id = cc.id and bb.mydate = cc.myccdate
	left join aa on bb.id = aa.id;
quit;

  This produces

 

 

                                        The SAS System     
                         id       one      mydate  var1  var2  var3  something
                   
                        111       200  2014-05-30  bbb   ccc   ddd   yergle
                        113       500  2014-06-04  ABC   fff   ggg   weezle


 

 

 

Maybe this is completely off base.  Maybe it will be useful.   

Solution
4 weeks ago
Esteemed Advisor
Posts: 5,335

Re: Conducting a fuzzy merge on dates while specifying a minimum difference between dates

[ Edited ]

Note that within a a.start group, the value of a.start is constant. So the condition

 

min((a.start-b.start)) = ((a.start-b.start))

 

is equivalent to

 

min(-b.start) = -b.start

 

which is the same as

 

max(b.start) = b.start

 

you could thus use the query

 

proc sql;
  create table want as
    select 
      a.*,
      b.start as r_start format=date9.,
      b.end   as r_end format=date9.,
      b.var1,
      b.var2,
      b.var3
    from 
      a left join b
        on a.id=b.id and a.start >= b.start
      group by a.id, a.start
      having max(b.start) = b.start
    ;
quit;

 

PG
Occasional Contributor
Posts: 7

Re: Conducting a fuzzy merge on dates while specifying a minimum difference between dates

This worked perfectly. I was too focused on trying to integrate everything on the having line that I didn't think to remove all of the cases where time b > time a during the actual join. 

 

Many thanks!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 121 views
  • 0 likes
  • 4 in conversation