The following sql join does what I want, but I'd like to see what the hash equivalent in a data step would look like:
data have1;
informat dt1 anydtdtm19.;
format dt1 datetime19.;
input recnum1 dt1;
cards;
1 10/01/2012:00:00:10
2 10/01/2012:00:00:30
3 10/01/2012:00:00:50
4 10/01/2012:00:01:10
5 10/01/2012:00:01:30
;
data have2;
informat dt2 anydtdtm19.;
format dt2 datetime19.;
input recnum2 dt2;
cards;
1 10/01/2012:00:00:15
2 10/01/2012:00:00:16
3 10/01/2012:00:00:35
4 10/01/2012:00:00:37
5 10/01/2012:00:00:55
6 10/01/2012:00:01:18
7 10/01/2012:00:01:45
;
proc sql;
create table want as
select *, abs(b.dt2-a.dt1) as diff
from have1 a, have2 b
group by a.dt1
having calculated diff = min(calculated diff)
;
quit;
Hi Art,
Now I finally ditch Hiter():
data want;
if _n_=1 then do;
if 0 then set have2;
declare hash h(dataset:'have2', multidata:'y');
h.definekey('group2');
h.definedata(all:'y');
h.definedone();
/* declare hiter hi('h');*/
end;
set have1;
_rc=h.find(key:group1);
diff=dt2+dt1;;
do _rc=0 by 0 while (_rc=0);
_d=abs(dt2-dt1);
if _d<diff then do;
_num=recnum2;
diff=_d;
_dt=dt2;
end;
_rc=h.find_next();
end;
recnum2=_num;
dt2=_dt;
drop _:;
run;
Haikuo
Hi Art,
It can be done, but it won't be pretty:
data want;
if _n_=1 then do;
if 0 then set have2;
declare hash h(dataset:'have2');
h.definekey('recnum2');
h.definedata(all:'y');
h.definedone();
declare hiter hi('h');
end;
set have1;
_rc=hi.first();
diff=dt2+dt1;;
do _rc=0 by 0 while (_rc=0);
_d=abs(dt2-dt1);
if _d<diff then do;
_num=recnum2;
diff=_d;
_dt=dt2;
end;
_rc=hi.next();
end;
recnum2=_num;
dt2=_dt;
drop _:;
run;
Haikuo
I'm running your code now. I'll let you know if it works.
I'd be curious to know how it compares with (tested with your example data only) :
data want(keep=dt1 recNum1 dtC recNumC);
retain recNumR dtR recNum2 dt2;
format dtC datetime19.;
set have1;
do while (dt2 < dt1 and not have2end);
set have2 end=have2end;
if dt2 < dt1 then do;
recNumR = recNum2;
dtR = dt2;
end;
end;
if dt1 + dt1 - dt2 < dtR then do;
dtC = dtR;
recNumC = recNumR;
end;
else do;
dtC = dt2;
recNumC = recNum2;
end;
run;
PG
: I'll test it as soon as I get a chance this weekend
: Beauty is in the eye of the beholder! From two hours (for the proc sql version), to just over 4 minutes for your version, is extremely pretty to me!
An extra credit question for both approaches. What if I need to expand the task to accommodate multiple groups and need the match to be group specific (i.e., in the sql code had an extra line: on gorup1=group2)? e.g.:
data have1;
informat dt1 anydtdtm19.;
format dt1 datetime19.;
input group1 recnum1 dt1;
cards;
1 1 10/01/2012:00:00:10
1 2 10/01/2012:00:00:30
1 3 10/01/2012:00:00:50
1 4 10/01/2012:00:01:10
1 5 10/01/2012:00:01:30
2 1 10/01/2012:00:00:11
2 2 10/01/2012:00:00:31
2 3 10/01/2012:00:00:51
2 4 10/01/2012:00:01:11
2 5 10/01/2012:00:01:31
3 1 10/01/2012:00:00:12
3 2 10/01/2012:00:00:32
3 3 10/01/2012:00:00:52
3 4 10/01/2012:00:01:12
3 5 10/01/2012:00:01:32
;
data have2;
informat dt2 anydtdtm19.;
format dt2 datetime19.;
input group2 recnum2 dt2;
cards;
1 1 10/01/2012:00:00:15
1 2 10/01/2012:00:00:16
1 3 10/01/2012:00:00:35
1 4 10/01/2012:00:00:37
1 5 10/01/2012:00:00:55
1 6 10/01/2012:00:01:18
1 7 10/01/2012:00:01:45
2 1 10/01/2012:00:00:16
2 2 10/01/2012:00:00:17
2 3 10/01/2012:00:00:36
2 4 10/01/2012:00:00:38
2 5 10/01/2012:00:00:56
2 6 10/01/2012:00:01:19
2 7 10/01/2012:00:01:46
3 1 10/01/2012:00:00:15
3 2 10/01/2012:00:00:16
3 3 10/01/2012:00:00:35
3 4 10/01/2012:00:00:37
3 5 10/01/2012:00:00:55
3 6 10/01/2012:00:01:18
3 7 10/01/2012:00:01:45
;
Hi Art,
Would you please post a simple question so I would be able to help?
Hi Art,
Now I finally ditch Hiter():
data want;
if _n_=1 then do;
if 0 then set have2;
declare hash h(dataset:'have2', multidata:'y');
h.definekey('group2');
h.definedata(all:'y');
h.definedone();
/* declare hiter hi('h');*/
end;
set have1;
_rc=h.find(key:group1);
diff=dt2+dt1;;
do _rc=0 by 0 while (_rc=0);
_d=abs(dt2-dt1);
if _d<diff then do;
_num=recnum2;
diff=_d;
_dt=dt2;
end;
_rc=h.find_next();
end;
recnum2=_num;
dt2=_dt;
drop _:;
run;
Haikuo
: You outdid yourself. This one only took 7.5 minutes to run a set of files that were four times larger than the first file. I'm impressed!
: Amazingly fast, even faster than Haikuo's code, but had an error. The last record is assigned to the wrong group. Both records 4 and 5 should have been assigned to have2's record number 6, but only one was.
Hi Art, I havent lost all hope. Getting to the wrong answer the fastest and (I assume) with the smallest memory usage is already something! I think I fixed it :
data want (keep=recNum1 dt1 recNumC dtC);
retain recNumR dtR recNum2 dt2;
format dtC dtR datetime19.;
set have1;
do while (dt2 < dt1 and not have2end);
if dtR <= dt2 then do;
recNumR = recNum2;
dtR = dt2;
end;
set have2 end=have2end;
end;
if dt1 + dt1 - dt2 < dtR then do;
dtC = dtR;
recNumC = recNumR;
end;
else do;
dtC = dt2;
recNumC = recNum2;
end;
run;
PG
: I'd have given you another helpful mark, but I've already used up my allowable ratings for this question. I'm leaving Haikuo's response as being the correct one, as the thread did in fact ask for a hash solution.
However, that said, your suggested approach was definitely the best way to go and the one I actually ended up using. On the actual data, the original proc sql approach was excessive, as it took eight hours and ate up over 500 gigabytes of disk space during the process.
Haikuo's approach was a significant improvement, only taking 7.5 minutes and only using up the disk space needed to accommodate the raw data*2.
I went with a variant of your code, though, and obtained the same results, only taking a total of 4.5 seconds, and only using up the disk space needed to accommodate the raw data*2. In short, I'm extremely grateful that you suggested the sequential approach, as it was CLEARLY the best way to solve the problem.
Thanks Art. That's an astonishing improvement. - PG
That's an astonishing piece of code! It's a pattern that I've never seen before, nor thought of, but it's brilliant.
It's interesting that Art's original SQL is so clean, I would have thought this more difficult to express in SQL. But you can certainly see that it needs to go into a full-blown Cartesian to solve it, so I'm not surprised the duration is very long.
The hash solution is much faster, but at the expense of i) using up memory for the entire table, so at some volume the solution will fail; and ii) it still implements a cartesian-style double DOW loop.
In yours, for the simple cost of sorting both datasets, you've turned it into roughly a nobs(a) + nobs(b), instead of nobs(a) * nobs(b). Also, there are no memory implications, so you could scale this till you run out of disk.
Brilliant! My hat's off to you today.
BTW , how many records were in your real have1 and have2, just to get an idea of volumetrics?
Tom
: The real data had over 100,000 records per file, each file contained over 300 variables, and there were four by groups. The variant of Pierre's code that I actually implemented sorted the files, did a proc transpose on have2, loaded the have2 datetimes into an array at the start of each by group, and then used Pierre's sequential search to obtain the closest values from the array.
As soon as I saw Pierre's code I realized that a sequential search was all that was needed. In short, I totally agree with your summation.
Arthur,
Acutally you can use a simple array to get it, No reason for Hash Table .
data have1; informat dt1 anydtdtm19.; format dt1 datetime19.; input recnum1 dt1; cards; 1 10/01/2012:00:00:10 2 10/01/2012:00:00:30 3 10/01/2012:00:00:50 4 10/01/2012:00:01:10 5 10/01/2012:00:01:30 ; data have2; informat dt2 anydtdtm19.; format dt2 datetime19.; input recnum2 dt2; cards; 1 10/01/2012:00:00:15 2 10/01/2012:00:00:16 3 10/01/2012:00:00:35 4 10/01/2012:00:00:37 5 10/01/2012:00:00:55 6 10/01/2012:00:01:18 7 10/01/2012:00:01:45 ; data want; array a{10000} _temporary_; array b{10000} _temporary_; do until(last); set have2 end=last; n+1; a{n}=dt2; b{n}=recnum2; end; do until(_last); set have1 end=_last; temp=99999; do i=1 to n; if abs(a{i}-dt1) lt temp then do;ii=i;min=abs(a{i}-dt1);end; temp=abs(a{i}-dt1); end; dt2=a{ii}; recnum2=b{ii}; output; end; drop i n ii temp; run;
Ksharp
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.