DATA Step, Macro, Functions and more

What would be hash equivalent for this sql join?

Accepted Solution Solved
Reply
PROC Star
Posts: 7,363
Accepted Solution

What would be hash equivalent for this sql join?

The following sql join does what I want, but I'd like to see what the hash equivalent in a data step would look like:

data have1;

  informat dt1 anydtdtm19.;

  format dt1 datetime19.;

  input recnum1 dt1;

  cards;

1 10/01/2012:00:00:10

2 10/01/2012:00:00:30

3 10/01/2012:00:00:50

4 10/01/2012:00:01:10

5 10/01/2012:00:01:30

;

data have2;

  informat dt2 anydtdtm19.;

  format dt2 datetime19.;

  input recnum2 dt2;

  cards;

1 10/01/2012:00:00:15

2 10/01/2012:00:00:16

3 10/01/2012:00:00:35

4 10/01/2012:00:00:37

5 10/01/2012:00:00:55

6 10/01/2012:00:01:18

7 10/01/2012:00:01:45

;

proc sql;

  create table want as

    select *, abs(b.dt2-a.dt1) as diff

      from have1 a, have2 b

          group by a.dt1

            having calculated diff = min(calculated diff)

  ;

quit;


Accepted Solutions
Solution
‎11-30-2012 10:04 PM
Respected Advisor
Posts: 3,124

Re: What would be hash equivalent for this sql join?

Hi Art,

Now I finally ditch Hiter():

data want;

  if _n_=1 then do;

  if 0 then set have2;

declare hash h(dataset:'have2', multidata:'y');

h.definekey('group2');

h.definedata(all:'y');

  h.definedone();

/*  declare hiter hi('h');*/

  end;

  set have1;

  _rc=h.find(key:group1);

  diff=dt2+dt1;;

do _rc=0 by 0 while (_rc=0);

_d=abs(dt2-dt1);

  if _d<diff then do;

  _num=recnum2;

  diff=_d;

_dt=dt2;

end;

_rc=h.find_next();

end;

  recnum2=_num;

dt2=_dt;

drop _:;

run;

Haikuo

View solution in original post


All Replies
Respected Advisor
Posts: 3,124

Re: What would be hash equivalent for this sql join?

Hi Art,

It can  be done, but it won't be pretty:

data want;

  if _n_=1 then do;

  if 0 then set have2;

declare hash h(dataset:'have2');

h.definekey('recnum2');

h.definedata(all:'y');

  h.definedone();

  declare hiter hi('h');

  end;

  set have1;

  _rc=hi.first();

diff=dt2+dt1;;

do _rc=0 by 0 while (_rc=0);

_d=abs(dt2-dt1);

  if _d<diff then do;

  _num=recnum2;

diff=_d;

_dt=dt2;

end;

_rc=hi.next();

end;

  recnum2=_num;

dt2=_dt;

drop _:;

  run;

Haikuo

PROC Star
Posts: 7,363

Re: What would be hash equivalent for this sql join?

I'm running your code now.  I'll let you know if it works.

Respected Advisor
Posts: 4,649

Re: What would be hash equivalent for this sql join?

I'd be curious to know how it compares with (tested with your example data only) :

data want(keep=dt1 recNum1 dtC recNumC);

retain recNumR dtR recNum2 dt2;

format dtC datetime19.;

set have1;

do while (dt2 < dt1 and not have2end);

     set have2 end=have2end;

     if dt2 < dt1 then do;

           recNumR = recNum2;

           dtR = dt2;

           end;

     end;

if dt1 + dt1 - dt2 < dtR then do;

     dtC = dtR;

     recNumC = recNumR;

     end;

else do;

     dtC = dt2;

     recNumC = recNum2;

     end;

run;

PG

PG
PROC Star
Posts: 7,363

Re: What would be hash equivalent for this sql join?

: I'll test it as soon as I get a chance this weekend

: Beauty is in the eye of the beholder!  From two hours (for the proc sql version), to just over 4 minutes for your version, is extremely pretty to me!

An extra credit question for both approaches.  What if I need to expand the task to accommodate multiple groups and need the match to be group specific (i.e., in the sql code had an extra line: on gorup1=group2)?  e.g.:

data have1;

  informat dt1 anydtdtm19.;

  format dt1 datetime19.;

  input group1 recnum1 dt1;

  cards;

1 1 10/01/2012:00:00:10

1 2 10/01/2012:00:00:30

1 3 10/01/2012:00:00:50

1 4 10/01/2012:00:01:10

1 5 10/01/2012:00:01:30

2 1 10/01/2012:00:00:11

2 2 10/01/2012:00:00:31

2 3 10/01/2012:00:00:51

2 4 10/01/2012:00:01:11

2 5 10/01/2012:00:01:31

3 1 10/01/2012:00:00:12

3 2 10/01/2012:00:00:32

3 3 10/01/2012:00:00:52

3 4 10/01/2012:00:01:12

3 5 10/01/2012:00:01:32

;

data have2;

  informat dt2 anydtdtm19.;

  format dt2 datetime19.;

  input group2 recnum2 dt2;

  cards;

1 1 10/01/2012:00:00:15

1 2 10/01/2012:00:00:16

1 3 10/01/2012:00:00:35

1 4 10/01/2012:00:00:37

1 5 10/01/2012:00:00:55

1 6 10/01/2012:00:01:18

1 7 10/01/2012:00:01:45

2 1 10/01/2012:00:00:16

2 2 10/01/2012:00:00:17

2 3 10/01/2012:00:00:36

2 4 10/01/2012:00:00:38

2 5 10/01/2012:00:00:56

2 6 10/01/2012:00:01:19

2 7 10/01/2012:00:01:46

3 1 10/01/2012:00:00:15

3 2 10/01/2012:00:00:16

3 3 10/01/2012:00:00:35

3 4 10/01/2012:00:00:37

3 5 10/01/2012:00:00:55

3 6 10/01/2012:00:01:18

3 7 10/01/2012:00:01:45

;

Super Contributor
Posts: 1,636

Re: What would be hash equivalent for this sql join?

Hi Art,

Would you  please post a simple question so I would be able to help?

PROC Star
Posts: 7,363

Re: What would be hash equivalent for this sql join?

: How often to I EVER post questions other than questions about what someone was asking for?  Bet you could have answered this one if you had tried!

Solution
‎11-30-2012 10:04 PM
Respected Advisor
Posts: 3,124

Re: What would be hash equivalent for this sql join?

Hi Art,

Now I finally ditch Hiter():

data want;

  if _n_=1 then do;

  if 0 then set have2;

declare hash h(dataset:'have2', multidata:'y');

h.definekey('group2');

h.definedata(all:'y');

  h.definedone();

/*  declare hiter hi('h');*/

  end;

  set have1;

  _rc=h.find(key:group1);

  diff=dt2+dt1;;

do _rc=0 by 0 while (_rc=0);

_d=abs(dt2-dt1);

  if _d<diff then do;

  _num=recnum2;

  diff=_d;

_dt=dt2;

end;

_rc=h.find_next();

end;

  recnum2=_num;

dt2=_dt;

drop _:;

run;

Haikuo

PROC Star
Posts: 7,363

Re: What would be hash equivalent for this sql join?

: You outdid yourself.  This one only took 7.5 minutes to run a set of files that were four times larger than the first file.  I'm impressed!

: Amazingly fast, even faster than Haikuo's code, but had an error.  The last record is assigned to the wrong group.  Both records 4 and 5 should have been assigned to have2's record number 6, but only one was.

Respected Advisor
Posts: 4,649

Re: What would be hash equivalent for this sql join?

Hi Art, I havent lost all hope. Getting to the wrong answer the fastest and (I assume) with the smallest memory usage is already something! I think I fixed it :

data want (keep=recNum1 dt1 recNumC dtC);

retain recNumR dtR recNum2 dt2;

format dtC dtR datetime19.;

set have1;

do while (dt2 < dt1 and not have2end);

     if dtR <= dt2 then do;

          recNumR = recNum2;

          dtR = dt2;

          end;

     set have2 end=have2end;

     end;

if dt1 + dt1 - dt2 < dtR then do;

     dtC = dtR;

     recNumC = recNumR;

     end;

else do;

     dtC = dt2;

     recNumC = recNum2;

     end;

run;

PG

PG
PROC Star
Posts: 7,363

Re: What would be hash equivalent for this sql join?

: I'd have given you another helpful mark, but I've already used up my allowable ratings for this question.  I'm leaving Haikuo's response as being the correct one, as the thread did in fact ask for a hash solution.

However, that said, your suggested approach was definitely the best way to go and the one I actually ended up using.  On the actual data, the original proc sql approach was excessive, as it took eight hours and ate up over 500 gigabytes of disk space during the process.

Haikuo's approach was a significant improvement, only taking 7.5 minutes and only using up the disk space needed to accommodate the raw data*2.

I went with a variant of your code, though, and obtained the same results, only taking a total of 4.5 seconds, and only using up the disk space needed to accommodate the raw data*2.  In short, I'm extremely grateful that you suggested the sequential approach, as it was CLEARLY the best way to solve the problem.

Respected Advisor
Posts: 4,649

Re: What would be hash equivalent for this sql join?

Thanks Art. That's an astonishing improvement.  - PG

PG
PROC Star
Posts: 1,093

Re: What would be hash equivalent for this sql join?

Hi,

That's an astonishing piece of code! It's a pattern that I've never seen before, nor thought of, but it's brilliant.

It's interesting that Art's original SQL is so clean, I would have thought this more difficult to express in SQL. But you can certainly see that it needs to go into a full-blown Cartesian to solve it, so I'm not surprised the duration is very long.

The hash solution is much faster, but at the expense of i) using up memory for the entire table, so at some volume the solution will fail; and ii) it still implements a cartesian-style double DOW loop.

In yours, for the simple cost of sorting both datasets, you've turned it into roughly a nobs(a) + nobs(b), instead of nobs(a) * nobs(b). Also, there are no memory implications, so you could scale this till you run out of disk.

Brilliant! My hat's off to you today.

BTW , how many records were in your real have1 and have2, just to get an idea of volumetrics?

Tom

PROC Star
Posts: 7,363

Re: What would be hash equivalent for this sql join?

: The real data had over 100,000 records per file, each file contained over 300 variables, and there were four by groups.  The variant of Pierre's code that I actually implemented sorted the files, did a proc transpose on have2, loaded the have2 datetimes into an array at the start of each by group, and then used Pierre's sequential search to obtain the closest values from the array.

As soon as I saw Pierre's code I realized that a sequential search was all that was needed.  In short, I totally agree with your summation.

Super User
Posts: 9,681

Re: What would be hash equivalent for this sql join?

Arthur,

Acutally you can use a simple array to get it, No reason for Hash Table .

data have1;
  informat dt1 anydtdtm19.;
  format dt1 datetime19.;
  input recnum1 dt1;
  cards;
1 10/01/2012:00:00:10
2 10/01/2012:00:00:30
3 10/01/2012:00:00:50
4 10/01/2012:00:01:10
5 10/01/2012:00:01:30
;
 
data have2;
  informat dt2 anydtdtm19.;
  format dt2 datetime19.;
  input recnum2 dt2;
  cards;
1 10/01/2012:00:00:15
2 10/01/2012:00:00:16
3 10/01/2012:00:00:35
4 10/01/2012:00:00:37
5 10/01/2012:00:00:55
6 10/01/2012:00:01:18
7 10/01/2012:00:01:45
;
data want;
 array a{10000} _temporary_;
 array b{10000} _temporary_;

 do until(last);
  set have2 end=last;
  n+1;
  a{n}=dt2;
  b{n}=recnum2;
 end;

do until(_last);
set have1 end=_last;
temp=99999;
 do i=1 to n;
  if abs(a{i}-dt1) lt temp then do;ii=i;min=abs(a{i}-dt1);end;
  temp=abs(a{i}-dt1);
 end;
dt2=a{ii};
recnum2=b{ii};
 output;
end;
drop i n ii  temp;
run;

Ksharp

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 519 views
  • 6 likes
  • 6 in conversation