04-11-2017 02:23 AM
Supposed I have two SAS datasets - call these JOBS and DATA. They have two common variables TIME and NAME.
I sort JOBS and DATA by TIME and then NAME.
So, I code -
Merge JOBS DATA;
by TIME NAME;
This assumes that the TIME matches exactly for both JOBS and DATA.
My problem is that TIME for both JOBS and DATA are not "EQUAL" - but can be off by up to 10 seconds, let's say.
There is another variable in both datasets (NAME) - that can be used to match as well.
For example - in JOBS -
TIME NAME PET
120005 ABC CAT
120007 ABD DOG
120403 ABC COW
In DATA -
TIME NAME ADDRESS
120003 ABC London
120008 ABD Paris
120400 ABC Warsaw
Combined output -
TIME NAME PET ADDRESS
120005 ABC CAT London
120007 ABD DOG Paris
120403 ABC COW Warsaw
Is there a way to do this?
04-11-2017 02:51 AM - edited 04-11-2017 03:09 AM
Many ways to do this but here is one
data JOBS; input TIME NAME $ PET $; datalines; 120005 ABC CAT 120007 ABD DOG 120403 ABC COW ; data DATA; input TIME NAME $ ADRESS $; datalines; 120003 ABC London 120008 ABD Paris 120400 ABC Warsaw ; proc sort data = JOBS;by TIME NAME;run; proc sort data = DATA;by TIME NAME;run; data want; format TIME NAME PET ADRESS; merge JOBS(rename=(TIME=JOBS_TIME NAME = JOBS_NAME)) DATA(rename=(TIME=DATA_TIME)); if (JOBS_TIME - DATA_TIME) <= 10 then do; TIME = JOBS_TIME; if NAME = JOBS_NAME then output; end; else if (DATA_TIME - JOBS_TIME) <= 10 then do; TIME = DATA_TIME; if NAME = JOBS_NAME then output; end; drop DATA_TIME JOBS_TIME JOBS_NAME; run;
04-11-2017 05:23 AM
The question here really is what is going to be your fixed logic. For instance if you have several rows which are all within 10 seconds of multiple timepoints, what then, is it a many to many merge it that case, i.e. you clause would be:
from HAVE1 A
full join HAVE2 B
and A.TIME-10 <= B.TIME <= A.TIME+10
However if you want 1 row per item, tehn it becomes more difficult, would you take the first time occuring within that group in the window, or the last/mean etc.In your example, 120005 is within 10 seconds of both 120003 and 120008. You seem to take the lowest, but do confirm. If so, then maybe something like:
proc sql; create table WANT as select A.*, B.ADDRESS from HAVE1 A left join (select NAME,ADDRESS from (select * from HAVE2 where TIME-10 <= A.TIME <= TIME+10) group by NAME,ADDRESS having TIME=min(TIME)) B on A.NAME=B.NAME; quit;
Also, Post test data in the form of a datastep!
04-11-2017 09:00 AM
data JOBS; input TIME NAME $ PET $; datalines; 120005 ABC CAT 120007 ABD DOG 120403 ABC COW ; data DATA; input TIME NAME $ ADRESS $; datalines; 120003 ABC London 120008 ABD Paris 120400 ABC Warsaw ; proc sql; select a.*,b.adress from jobs as a left jion data as b on a.name=b.name and b.time between a.time-10 and a.time+10; quit;