DATA Step, Macro, Functions and more

Merge datasets with fuzzy matches

Reply
Occasional Contributor
Posts: 6

Merge datasets with fuzzy matches

Hi,

 

Supposed I have two SAS datasets - call these JOBS and DATA.  They have two common variables TIME and NAME.

 

I sort JOBS and DATA by TIME and then NAME.

 

So, I code -

 

Data Combined;

  Merge JOBS DATA;

  by TIME NAME;

Run;

 

This assumes that the TIME matches exactly for both JOBS and DATA.

 

My problem is that TIME for both JOBS and DATA are not "EQUAL" - but can be off by up to 10 seconds, let's say.

 

There is another variable in both datasets (NAME) - that can be used to match as well. 

 

For example - in JOBS -

TIME       NAME    PET

120005   ABC       CAT

120007   ABD       DOG

120403   ABC       COW

 

In DATA -

TIME        NAME     ADDRESS

120003    ABC        London

120008    ABD        Paris

120400    ABC        Warsaw

 

Combined output -

 

TIME        NAME      PET           ADDRESS

120005    ABC         CAT           London

120007    ABD         DOG          Paris

120403    ABC         COW          Warsaw

 

Is there a way to do this? 

 

Thank you.

PROC Star
Posts: 733

Re: Merge datasets with fuzzy matches

[ Edited ]

Many ways to do this but here is one Smiley Happy

 

data JOBS;
input TIME NAME $ PET $;
datalines;
120005 ABC CAT
120007 ABD DOG
120403 ABC COW
;

data DATA;
input TIME NAME $ ADRESS $;
datalines;
120003 ABC London
120008 ABD Paris
120400 ABC Warsaw
;


proc sort data = JOBS;by TIME NAME;run;
proc sort data = DATA;by TIME NAME;run;

data want;
	format TIME NAME PET ADRESS;
	merge JOBS(rename=(TIME=JOBS_TIME NAME = JOBS_NAME)) 
		  DATA(rename=(TIME=DATA_TIME));

	if (JOBS_TIME - DATA_TIME) <= 10 then do;
		TIME = JOBS_TIME;
		if NAME = JOBS_NAME then output;
	end;
	else if (DATA_TIME - JOBS_TIME) <= 10 then do;
		TIME = DATA_TIME;
		if NAME = JOBS_NAME then output;
	end;
	
	drop DATA_TIME JOBS_TIME JOBS_NAME;
run;
Occasional Contributor
Posts: 6

Re: Merge datasets with fuzzy matches

Thank you, very much. 

 

I will test it out. 

 

TS

PROC Star
Posts: 733

Re: Merge datasets with fuzzy matches

Just edited the code a bit Smiley Happy Test that instead

Super User
Super User
Posts: 7,942

Re: Merge datasets with fuzzy matches

The question here really is what is going to be your fixed logic.  For instance if you have several rows which are all within 10 seconds of multiple timepoints, what then, is it a many to many merge it that case, i.e. you clause would be:

from HAVE1 A

full join HAVE2 B

on    A.NAME=B.NAME

and  A.TIME-10 <= B.TIME <= A.TIME+10

 

However if you want 1 row per item, tehn it becomes more difficult, would you take the first time occuring within that group in the window, or the last/mean etc.In your example, 120005 is within 10 seconds of both 120003 and 120008.  You seem to take the lowest, but do confirm.  If so, then maybe something like:

proc sql;
  create table WANT as
  select A.*,
         B.ADDRESS
  from   HAVE1 A
  left join (select NAME,ADDRESS from (select * from HAVE2 where TIME-10 <= A.TIME <= TIME+10) group by NAME,ADDRESS having TIME=min(TIME)) B
  on     A.NAME=B.NAME;
quit;
           

Also, Post test data in the form of a datastep!

Super User
Posts: 10,020

Re: Merge datasets with fuzzy matches


data JOBS;
input TIME NAME $ PET $;
datalines;
120005 ABC CAT
120007 ABD DOG
120403 ABC COW
;

data DATA;
input TIME NAME $ ADRESS $;
datalines;
120003 ABC London
120008 ABD Paris
120400 ABC Warsaw
;

proc sql;
select a.*,b.adress
 from jobs as a left jion data as b
  on a.name=b.name and b.time between a.time-10 and a.time+10;
quit;
Ask a Question
Discussion stats
  • 5 replies
  • 149 views
  • 0 likes
  • 4 in conversation