Re: Merge datasets with fuzzy matches

TingSern · Posted 04-11-2017 02:23 AM

Hi,

Supposed I have two SAS datasets - call these JOBS and DATA. They have two common variables TIME and NAME.

I sort JOBS and DATA by TIME and then NAME.

So, I code -

Data Combined;

Merge JOBS DATA;

by TIME NAME;

Run;

This assumes that the TIME matches exactly for both JOBS and DATA.

My problem is that TIME for both JOBS and DATA are not "EQUAL" - but can be off by up to 10 seconds, let's say.

There is another variable in both datasets (NAME) - that can be used to match as well.

For example - in JOBS -

TIME NAME PET

120005 ABC CAT

120007 ABD DOG

120403 ABC COW

In DATA -

TIME NAME ADDRESS

120003 ABC London

120008 ABD Paris

120400 ABC Warsaw

Combined output -

TIME NAME PET ADDRESS

120005 ABC CAT London

120007 ABD DOG Paris

120403 ABC COW Warsaw

Is there a way to do this?

Thank you.

PeterClemmensen · Posted 04-11-2017 02:51 AM

Many ways to do this but here is one 🙂

data JOBS;
input TIME NAME $ PET $;
datalines;
120005 ABC CAT
120007 ABD DOG
120403 ABC COW
;

data DATA;
input TIME NAME $ ADRESS $;
datalines;
120003 ABC London
120008 ABD Paris
120400 ABC Warsaw
;


proc sort data = JOBS;by TIME NAME;run;
proc sort data = DATA;by TIME NAME;run;

data want;
	format TIME NAME PET ADRESS;
	merge JOBS(rename=(TIME=JOBS_TIME NAME = JOBS_NAME)) 
		  DATA(rename=(TIME=DATA_TIME));

	if (JOBS_TIME - DATA_TIME) <= 10 then do;
		TIME = JOBS_TIME;
		if NAME = JOBS_NAME then output;
	end;
	else if (DATA_TIME - JOBS_TIME) <= 10 then do;
		TIME = DATA_TIME;
		if NAME = JOBS_NAME then output;
	end;
	
	drop DATA_TIME JOBS_TIME JOBS_NAME;
run;

The DATA to DATA Step Macro
Blog: SASnrd

TingSern · Posted 04-11-2017 03:08 AM

Thank you, very much.

I will test it out.

TS

PeterClemmensen · Posted 04-11-2017 03:09 AM

Just edited the code a bit 🙂 Test that instead

The DATA to DATA Step Macro
Blog: SASnrd

RW9 · Posted 04-11-2017 05:23 AM

The question here really is what is going to be your fixed logic. For instance if you have several rows which are all within 10 seconds of multiple timepoints, what then, is it a many to many merge it that case, i.e. you clause would be:

from HAVE1 A

full join HAVE2 B

on A.NAME=B.NAME

and A.TIME-10 <= B.TIME <= A.TIME+10

However if you want 1 row per item, tehn it becomes more difficult, would you take the first time occuring within that group in the window, or the last/mean etc.In your example, 120005 is within 10 seconds of both 120003 and 120008. You seem to take the lowest, but do confirm. If so, then maybe something like:

proc sql;
  create table WANT as
  select A.*,
         B.ADDRESS
  from   HAVE1 A
  left join (select NAME,ADDRESS from (select * from HAVE2 where TIME-10 <= A.TIME <= TIME+10) group by NAME,ADDRESS having TIME=min(TIME)) B
  on     A.NAME=B.NAME;
quit;

Also, Post test data in the form of a datastep!

Ksharp · Posted 04-11-2017 09:00 AM


data JOBS;
input TIME NAME $ PET $;
datalines;
120005 ABC CAT
120007 ABD DOG
120403 ABC COW
;

data DATA;
input TIME NAME $ ADRESS $;
datalines;
120003 ABC London
120008 ABD Paris
120400 ABC Warsaw
;

proc sql;
select a.*,b.adress
 from jobs as a left jion data as b
  on a.name=b.name and b.time between a.time-10 and a.time+10;
quit;

Merge datasets with fuzzy matches