Solved: Re: What would be hash equivalent for this sql join?

art297 · Posted 11-30-2012 04:45 PM

The following sql join does what I want, but I'd like to see what the hash equivalent in a data step would look like:

data have1;

informat dt1 anydtdtm19.;

format dt1 datetime19.;

input recnum1 dt1;

cards;

1 10/01/2012:00:00:10

2 10/01/2012:00:00:30

3 10/01/2012:00:00:50

4 10/01/2012:00:01:10

5 10/01/2012:00:01:30

;

data have2;

informat dt2 anydtdtm19.;

format dt2 datetime19.;

input recnum2 dt2;

cards;

1 10/01/2012:00:00:15

2 10/01/2012:00:00:16

3 10/01/2012:00:00:35

4 10/01/2012:00:00:37

5 10/01/2012:00:00:55

6 10/01/2012:00:01:18

7 10/01/2012:00:01:45

;

proc sql;

create table want as

select *, abs(b.dt2-a.dt1) as diff

from have1 a, have2 b

group by a.dt1

having calculated diff = min(calculated diff)

;

quit;

Haikuo · Posted 11-30-2012 10:04 PM

Hi Art,

Now I finally ditch Hiter():

data want;

if _n_=1 then do;

if 0 then set have2;

declare hash h(dataset:'have2', multidata:'y');

h.definekey('group2');

h.definedata(all:'y');

h.definedone();

/* declare hiter hi('h');*/

end;

set have1;

_rc=h.find(key:group1);

diff=dt2+dt1;;

do _rc=0 by 0 while (_rc=0);

_d=abs(dt2-dt1);

if _d<diff then do;

_num=recnum2;

diff=_d;

_dt=dt2;

end;

_rc=h.find_next();

end;

recnum2=_num;

dt2=_dt;

drop _:;

run;

Haikuo

View solution in original post

Haikuo · Posted 11-30-2012 06:06 PM

Hi Art,

It can be done, but it won't be pretty:

data want;

if _n_=1 then do;

if 0 then set have2;

declare hash h(dataset:'have2');

h.definekey('recnum2');

h.definedata(all:'y');

h.definedone();

declare hiter hi('h');

end;

set have1;

_rc=hi.first();

diff=dt2+dt1;;

do _rc=0 by 0 while (_rc=0);

_d=abs(dt2-dt1);

if _d<diff then do;

_num=recnum2;

diff=_d;

_dt=dt2;

end;

_rc=hi.next();

end;

recnum2=_num;

dt2=_dt;

drop _:;

run;

Haikuo

art297 · Posted 11-30-2012 08:01 PM

I'm running your code now. I'll let you know if it works.

PGStats · Posted 11-30-2012 08:11 PM

I'd be curious to know how it compares with (tested with your example data only) :

data want(keep=dt1 recNum1 dtC recNumC);

retain recNumR dtR recNum2 dt2;

format dtC datetime19.;

set have1;

do while (dt2 < dt1 and not have2end);

set have2 end=have2end;

if dt2 < dt1 then do;

recNumR = recNum2;

dtR = dt2;

end;

if dt1 + dt1 - dt2 < dtR then do;

dtC = dtR;

recNumC = recNumR;

end;

else do;

dtC = dt2;

recNumC = recNum2;

end;

run;

PG

art297 · Posted 11-30-2012 09:29 PM

: I'll test it as soon as I get a chance this weekend

: Beauty is in the eye of the beholder! From two hours (for the proc sql version), to just over 4 minutes for your version, is extremely pretty to me!

An extra credit question for both approaches. What if I need to expand the task to accommodate multiple groups and need the match to be group specific (i.e., in the sql code had an extra line: on gorup1=group2)? e.g.:

data have1;

informat dt1 anydtdtm19.;

format dt1 datetime19.;

input group1 recnum1 dt1;

cards;

1 1 10/01/2012:00:00:10

1 2 10/01/2012:00:00:30

1 3 10/01/2012:00:00:50

1 4 10/01/2012:00:01:10

1 5 10/01/2012:00:01:30

2 1 10/01/2012:00:00:11

2 2 10/01/2012:00:00:31

2 3 10/01/2012:00:00:51

2 4 10/01/2012:00:01:11

2 5 10/01/2012:00:01:31

3 1 10/01/2012:00:00:12

3 2 10/01/2012:00:00:32

3 3 10/01/2012:00:00:52

3 4 10/01/2012:00:01:12

3 5 10/01/2012:00:01:32

;

data have2;

informat dt2 anydtdtm19.;

format dt2 datetime19.;

input group2 recnum2 dt2;

cards;

1 1 10/01/2012:00:00:15

1 2 10/01/2012:00:00:16

1 3 10/01/2012:00:00:35

1 4 10/01/2012:00:00:37

1 5 10/01/2012:00:00:55

1 6 10/01/2012:00:01:18

1 7 10/01/2012:00:01:45

2 1 10/01/2012:00:00:16

2 2 10/01/2012:00:00:17

2 3 10/01/2012:00:00:36

2 4 10/01/2012:00:00:38

2 5 10/01/2012:00:00:56

2 6 10/01/2012:00:01:19

2 7 10/01/2012:00:01:46

3 1 10/01/2012:00:00:15

3 2 10/01/2012:00:00:16

3 3 10/01/2012:00:00:35

3 4 10/01/2012:00:00:37

3 5 10/01/2012:00:00:55

3 6 10/01/2012:00:01:18

3 7 10/01/2012:00:01:45

;

Linlin · Posted 11-30-2012 09:45 PM

Hi Art,

Would you please post a simple question so I would be able to help?

art297 · Posted 11-30-2012 10:18 PM

: How often to I EVER post questions other than questions about what someone was asking for? Bet you could have answered this one if you had tried!

Haikuo · Posted 11-30-2012 10:04 PM

Hi Art,

Now I finally ditch Hiter():

data want;

if _n_=1 then do;

if 0 then set have2;

declare hash h(dataset:'have2', multidata:'y');

h.definekey('group2');

h.definedata(all:'y');

h.definedone();

/* declare hiter hi('h');*/

end;

set have1;

_rc=h.find(key:group1);

diff=dt2+dt1;;

do _rc=0 by 0 while (_rc=0);

_d=abs(dt2-dt1);

if _d<diff then do;

_num=recnum2;

diff=_d;

_dt=dt2;

end;

_rc=h.find_next();

end;

recnum2=_num;

dt2=_dt;

drop _:;

run;

Haikuo

art297 · Posted 12-01-2012 01:13 AM

: You outdid yourself. This one only took 7.5 minutes to run a set of files that were four times larger than the first file. I'm impressed!

: Amazingly fast, even faster than Haikuo's code, but had an error. The last record is assigned to the wrong group. Both records 4 and 5 should have been assigned to have2's record number 6, but only one was.

PGStats · Posted 12-01-2012 03:58 PM

Hi Art, I havent lost all hope. Getting to the wrong answer the fastest and (I assume) with the smallest memory usage is already something! I think I fixed it :

data want (keep=recNum1 dt1 recNumC dtC);

retain recNumR dtR recNum2 dt2;

format dtC dtR datetime19.;

set have1;

do while (dt2 < dt1 and not have2end);

if dtR <= dt2 then do;

recNumR = recNum2;

dtR = dt2;

end;

set have2 end=have2end;

end;

if dt1 + dt1 - dt2 < dtR then do;

dtC = dtR;

recNumC = recNumR;

end;

else do;

dtC = dt2;

recNumC = recNum2;

end;

run;

PG

art297 · Posted 12-02-2012 10:26 AM

: I'd have given you another helpful mark, but I've already used up my allowable ratings for this question. I'm leaving Haikuo's response as being the correct one, as the thread did in fact ask for a hash solution.

However, that said, your suggested approach was definitely the best way to go and the one I actually ended up using. On the actual data, the original proc sql approach was excessive, as it took eight hours and ate up over 500 gigabytes of disk space during the process.

Haikuo's approach was a significant improvement, only taking 7.5 minutes and only using up the disk space needed to accommodate the raw data*2.

I went with a variant of your code, though, and obtained the same results, only taking a total of 4.5 seconds, and only using up the disk space needed to accommodate the raw data*2. In short, I'm extremely grateful that you suggested the sequential approach, as it was CLEARLY the best way to solve the problem.

PGStats · Posted 12-02-2012 11:05 AM

Thanks Art. That's an astonishing improvement. - PG

PG

TomKari · Posted 12-02-2012 11:52 AM

Hi,

That's an astonishing piece of code! It's a pattern that I've never seen before, nor thought of, but it's brilliant.

It's interesting that Art's original SQL is so clean, I would have thought this more difficult to express in SQL. But you can certainly see that it needs to go into a full-blown Cartesian to solve it, so I'm not surprised the duration is very long.

The hash solution is much faster, but at the expense of i) using up memory for the entire table, so at some volume the solution will fail; and ii) it still implements a cartesian-style double DOW loop.

In yours, for the simple cost of sorting both datasets, you've turned it into roughly a nobs(a) + nobs(b), instead of nobs(a) * nobs(b). Also, there are no memory implications, so you could scale this till you run out of disk.

Brilliant! My hat's off to you today.

BTW , how many records were in your real have1 and have2, just to get an idea of volumetrics?

Tom

art297 · Posted 12-02-2012 12:06 PM

: The real data had over 100,000 records per file, each file contained over 300 variables, and there were four by groups. The variant of Pierre's code that I actually implemented sorted the files, did a proc transpose on have2, loaded the have2 datetimes into an array at the start of each by group, and then used Pierre's sequential search to obtain the closest values from the array.

As soon as I saw Pierre's code I realized that a sequential search was all that was needed. In short, I totally agree with your summation.

Ksharp · Posted 12-03-2012 01:22 AM

Arthur,

Acutally you can use a simple array to get it, No reason for Hash Table .

data have1;
  informat dt1 anydtdtm19.;
  format dt1 datetime19.;
  input recnum1 dt1;
  cards;
1 10/01/2012:00:00:10
2 10/01/2012:00:00:30
3 10/01/2012:00:00:50
4 10/01/2012:00:01:10
5 10/01/2012:00:01:30
;
 
data have2;
  informat dt2 anydtdtm19.;
  format dt2 datetime19.;
  input recnum2 dt2;
  cards;
1 10/01/2012:00:00:15
2 10/01/2012:00:00:16
3 10/01/2012:00:00:35
4 10/01/2012:00:00:37
5 10/01/2012:00:00:55
6 10/01/2012:00:01:18
7 10/01/2012:00:01:45
;
data want;
 array a{10000} _temporary_;
 array b{10000} _temporary_;

 do until(last);
  set have2 end=last;
  n+1;
  a{n}=dt2;
  b{n}=recnum2;
 end;

do until(_last);
set have1 end=_last;
temp=99999;
 do i=1 to n;
  if abs(a{i}-dt1) lt temp then do;ii=i;min=abs(a{i}-dt1);end;
  temp=abs(a{i}-dt1);
 end;
dt2=a{ii};
recnum2=b{ii};
 output;
end;
drop i n ii  temp;
run;

Ksharp

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away