SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Merging two databases based on fuzzy dates (similar dates)

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

Merging two databases based on fuzzy dates (similar dates)

I will try to explain this the best that I can.

 

I currently have two databases: Database A and Database B - Both of these databases have the same people in them (id numbers), but the amount of observations is different (longfile). Database A features dates (start and end date) that I would like to keep. These dates are for a timeframe that was worked. Database B features some variables with details about the work (Var_1 Var_2 Var_3) and also features dates (start and end date), but these dates were tentative and the ones in Database A are the ones I want to keep. 

 

What I would like to do, is for each observation in Database A, try to identify the case from Database B with the most similar start and end date (for the same ID number) and then merge the line together, taking the variables about work details from Database B and the start and end date variables from Database A. Database A is to be used as the spine so for example:

 

Database A:

  • ID1 01/01/2000 31/01/2000
  • ID1 25/02/2001 20/10/2001

Database B:

  • ID1 03/01/2000 01/02/2000 Var_1 Var_2 Var_3
  • ID1 25/04/2000 28/05/2000 Var_1 Var_2 Var_3
  • ID1 25/07/2000 28/08/2000 Var_1 Var_2 Var_3
  • ID1 20/02/2001 17/10/2001 Var_1 Var_2 Var_3

Merged Database:

  • Rows 1 and 4 have the same ID# as in Database A and have the most similar dates. Dates from Database A are kept and var_1-var_3 are added. The dataset should look as follows:
    • ID1 01/01/2000 31/01/2000 Var_1 Var_2 Var_3
    • ID1 25/02/2001 20/10/2001 Var_1 Var_2 Var_3

Accepted Solutions
Solution
‎10-19-2017 09:56 AM
Respected Advisor
Posts: 4,674

Re: Merging two databases based on fuzzy dates (similar dates)

[ Edited ]

@djohn051

Great that you've provided sample data but please provide such sample data in the future via a data step as done below so we don't have to do this for you.

data a;
  input id $ start :ddmmyy10. end :ddmmyy10.;
  format start end date9.;
datalines;
ID1 01/01/2000 31/01/2000
ID1 25/02/2001 20/10/2001
;
run;

data b;
  input id $ start :ddmmyy10. end :ddmmyy10. (var1 var2 var3) ($);
  format start end date9.;
datalines;
ID1 03/01/2000 01/02/2000 Var_1 Var_2 Var_3
ID1 25/04/2000 28/05/2000 Var_1 Var_2 Var_3
ID1 25/07/2000 28/08/2000 Var_1 Var_2 Var_3
ID1 20/02/2001 17/10/2001 Var_1 Var_2 Var_3
;
run;

As for terminology: These are tables not databases. A database is the container for objects like tables so it's a level higher up.

 

As for your question code as below could do:

proc sql;
  create table want as
    select 
      a.*,
      b.start as r_start format=date9.,
      b.end   as r_end format=date9.,
      b.var1,
      b.var2,
      b.var3
    from 
      a left join b
        on a.id=b.id
      group by a.id, a.start
      having min(abs(a.start-b.start)+abs(a.end-b.end)) = (abs(a.start-b.start)+abs(a.end-b.end))
    ;
quit;

Please note: It is possible that more than one row from table B meets the criterion in the having clause so you might have to come up with additional filtering criteria to always get to a single row match to table B.

View solution in original post


All Replies
Solution
‎10-19-2017 09:56 AM
Respected Advisor
Posts: 4,674

Re: Merging two databases based on fuzzy dates (similar dates)

[ Edited ]

@djohn051

Great that you've provided sample data but please provide such sample data in the future via a data step as done below so we don't have to do this for you.

data a;
  input id $ start :ddmmyy10. end :ddmmyy10.;
  format start end date9.;
datalines;
ID1 01/01/2000 31/01/2000
ID1 25/02/2001 20/10/2001
;
run;

data b;
  input id $ start :ddmmyy10. end :ddmmyy10. (var1 var2 var3) ($);
  format start end date9.;
datalines;
ID1 03/01/2000 01/02/2000 Var_1 Var_2 Var_3
ID1 25/04/2000 28/05/2000 Var_1 Var_2 Var_3
ID1 25/07/2000 28/08/2000 Var_1 Var_2 Var_3
ID1 20/02/2001 17/10/2001 Var_1 Var_2 Var_3
;
run;

As for terminology: These are tables not databases. A database is the container for objects like tables so it's a level higher up.

 

As for your question code as below could do:

proc sql;
  create table want as
    select 
      a.*,
      b.start as r_start format=date9.,
      b.end   as r_end format=date9.,
      b.var1,
      b.var2,
      b.var3
    from 
      a left join b
        on a.id=b.id
      group by a.id, a.start
      having min(abs(a.start-b.start)+abs(a.end-b.end)) = (abs(a.start-b.start)+abs(a.end-b.end))
    ;
quit;

Please note: It is possible that more than one row from table B meets the criterion in the having clause so you might have to come up with additional filtering criteria to always get to a single row match to table B.

Esteemed Advisor
Posts: 5,482

Re: Merging two databases based on fuzzy dates (similar dates)

Added note:

 

@Patrick's minimum distance criteria is equivalent to:


min(sumabs(a.start-b.start, a.end-b.end)) = sumabs(a.start-b.start, a.end-b.end)

 

but it could also be:


min(euclid(a.start-b.start, a.end-b.end)) = euclid(a.start-b.start, a.end-b.end)

 

both of which try to match start and end times. You could also try to match the time interval centers with:

min(abs(a.start + a.end - b.start - b.end)) = abs(a.start + a.end - b.start - b.end)

PG
Occasional Contributor
Posts: 7

Re: Merging two databases based on fuzzy dates (similar dates)

First, thank you for your help. For the most part, this has been working, with the exception of what you had mentioned, when two observations (with the same SN) appear in table a, but only once in table b, they will both be included in table want, but both will correspond to the same dates and variables as the one case from table b

 

 

One additional question, in the following line:

 

      having min(abs(a.start-b.start)+abs(a.end-b.end)) = (abs(a.start-b.start)+abs(a.end-b.end))

 what is the purpose of the syntax after the equals (=) sign?

 

Many thanks!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 327 views
  • 3 likes
  • 3 in conversation