topic Re: Merging two databases based on fuzzy dates (similar dates) in SAS Data Management

Merging two databases based on fuzzy dates (similar dates)

djohn051 — Sat, 14 Oct 2017 03:29:02 GMT

I will try to explain this the best that I can.

I currently have two databases: Database A and Database B - Both of these databases have the same people in them (id numbers), but the amount of observations is different (longfile). Database A features dates (start and end date) that I would like to keep. These dates are for a timeframe that was worked. Database B features some variables with details about the work (Var_1 Var_2 Var_3) and also features dates (start and end date), but these dates were tentative and the ones in Database A are the ones I want to keep.

What I would like to do, is for each observation in Database A, try to identify the case from Database B with the most similar start and end date (for the same ID number) and then merge the line together, taking the variables about work details from Database B and the start and end date variables from Database A. Database A is to be used as the spine so for example:

Database A:

ID1 01/01/2000 31/01/2000
ID1 25/02/2001 20/10/2001

Database B:

ID1 03/01/2000 01/02/2000 Var_1 Var_2 Var_3
ID1 25/04/2000 28/05/2000 Var_1 Var_2 Var_3
ID1 25/07/2000 28/08/2000 Var_1 Var_2 Var_3
ID1 20/02/2001 17/10/2001 Var_1 Var_2 Var_3

Merged Database:

Rows 1 and 4 have the same ID# as in Database A and have the most similar dates. Dates from Database A are kept and var_1-var_3 are added. The dataset should look as follows:
- ID1 01/01/2000 31/01/2000 Var_1 Var_2 Var_3
- ID1 25/02/2001 20/10/2001 Var_1 Var_2 Var_3

Re: Merging two databases based on fuzzy dates (similar dates)

Patrick — Sun, 15 Oct 2017 00:37:16 GMT

@djohn051

Great that you've provided sample data but please provide such sample data in the future via a data step as done below so we don't have to do this for you.

data a;
  input id $ start :ddmmyy10. end :ddmmyy10.;
  format start end date9.;
datalines;
ID1 01/01/2000 31/01/2000
ID1 25/02/2001 20/10/2001
;
run;

data b;
  input id $ start :ddmmyy10. end :ddmmyy10. (var1 var2 var3) ($);
  format start end date9.;
datalines;
ID1 03/01/2000 01/02/2000 Var_1 Var_2 Var_3
ID1 25/04/2000 28/05/2000 Var_1 Var_2 Var_3
ID1 25/07/2000 28/08/2000 Var_1 Var_2 Var_3
ID1 20/02/2001 17/10/2001 Var_1 Var_2 Var_3
;
run;

As for terminology: These are tables not databases. A database is the container for objects like tables so it's a level higher up.

As for your question code as below could do:

proc sql;
  create table want as
    select 
      a.*,
      b.start as r_start format=date9.,
      b.end   as r_end format=date9.,
      b.var1,
      b.var2,
      b.var3
    from 
      a left join b
        on a.id=b.id
      group by a.id, a.start
      having min(abs(a.start-b.start)+abs(a.end-b.end)) = (abs(a.start-b.start)+abs(a.end-b.end))
    ;
quit;

Please note: It is possible that more than one row from table B meets the criterion in the having clause so you might have to come up with additional filtering criteria to always get to a single row match to table B.

Re: Merging two databases based on fuzzy dates (similar dates)

PGStats — Sun, 15 Oct 2017 05:09:56 GMT

Added note:

@Patrick's minimum distance criteria is equivalent to:

min(sumabs(a.start-b.start, a.end-b.end)) = sumabs(a.start-b.start, a.end-b.end)

but it could also be:

min(euclid(a.start-b.start, a.end-b.end)) = euclid(a.start-b.start, a.end-b.end)

both of which try to match start and end times. You could also try to match the time interval centers with:

min(abs(a.start + a.end - b.start - b.end)) = abs(a.start + a.end - b.start - b.end)

Re: Merging two databases based on fuzzy dates (similar dates)

djohn051 — Thu, 19 Oct 2017 17:53:32 GMT

First, thank you for your help. For the most part, this has been working, with the exception of what you had mentioned, when two observations (with the same SN) appear in table a, but only once in table b, they will both be included in table want, but both will correspond to the same dates and variables as the one case from table b

One additional question, in the following line:

      having min(abs(a.start-b.start)+abs(a.end-b.end)) = (abs(a.start-b.start)+abs(a.end-b.end))

what is the purpose of the syntax after the equals (=) sign?

Many thanks!