BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
art297
Opal | Level 21

The following sql join does what I want, but I'd like to see what the hash equivalent in a data step would look like:

data have1;

  informat dt1 anydtdtm19.;

  format dt1 datetime19.;

  input recnum1 dt1;

  cards;

1 10/01/2012:00:00:10

2 10/01/2012:00:00:30

3 10/01/2012:00:00:50

4 10/01/2012:00:01:10

5 10/01/2012:00:01:30

;

data have2;

  informat dt2 anydtdtm19.;

  format dt2 datetime19.;

  input recnum2 dt2;

  cards;

1 10/01/2012:00:00:15

2 10/01/2012:00:00:16

3 10/01/2012:00:00:35

4 10/01/2012:00:00:37

5 10/01/2012:00:00:55

6 10/01/2012:00:01:18

7 10/01/2012:00:01:45

;

proc sql;

  create table want as

    select *, abs(b.dt2-a.dt1) as diff

      from have1 a, have2 b

          group by a.dt1

            having calculated diff = min(calculated diff)

  ;

quit;

1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

Hi Art,

Now I finally ditch Hiter():

data want;

  if _n_=1 then do;

  if 0 then set have2;

declare hash h(dataset:'have2', multidata:'y');

h.definekey('group2');

h.definedata(all:'y');

  h.definedone();

/*  declare hiter hi('h');*/

  end;

  set have1;

  _rc=h.find(key:group1);

  diff=dt2+dt1;;

do _rc=0 by 0 while (_rc=0);

_d=abs(dt2-dt1);

  if _d<diff then do;

  _num=recnum2;

  diff=_d;

_dt=dt2;

end;

_rc=h.find_next();

end;

  recnum2=_num;

dt2=_dt;

drop _:;

run;

Haikuo

View solution in original post

14 REPLIES 14
Haikuo
Onyx | Level 15

Hi Art,

It can  be done, but it won't be pretty:

data want;

  if _n_=1 then do;

  if 0 then set have2;

declare hash h(dataset:'have2');

h.definekey('recnum2');

h.definedata(all:'y');

  h.definedone();

  declare hiter hi('h');

  end;

  set have1;

  _rc=hi.first();

diff=dt2+dt1;;

do _rc=0 by 0 while (_rc=0);

_d=abs(dt2-dt1);

  if _d<diff then do;

  _num=recnum2;

diff=_d;

_dt=dt2;

end;

_rc=hi.next();

end;

  recnum2=_num;

dt2=_dt;

drop _:;

  run;

Haikuo

art297
Opal | Level 21

I'm running your code now.  I'll let you know if it works.

PGStats
Opal | Level 21

I'd be curious to know how it compares with (tested with your example data only) :

data want(keep=dt1 recNum1 dtC recNumC);

retain recNumR dtR recNum2 dt2;

format dtC datetime19.;

set have1;

do while (dt2 < dt1 and not have2end);

     set have2 end=have2end;

     if dt2 < dt1 then do;

           recNumR = recNum2;

           dtR = dt2;

           end;

     end;

if dt1 + dt1 - dt2 < dtR then do;

     dtC = dtR;

     recNumC = recNumR;

     end;

else do;

     dtC = dt2;

     recNumC = recNum2;

     end;

run;

PG

PG
art297
Opal | Level 21

: I'll test it as soon as I get a chance this weekend

: Beauty is in the eye of the beholder!  From two hours (for the proc sql version), to just over 4 minutes for your version, is extremely pretty to me!

An extra credit question for both approaches.  What if I need to expand the task to accommodate multiple groups and need the match to be group specific (i.e., in the sql code had an extra line: on gorup1=group2)?  e.g.:

data have1;

  informat dt1 anydtdtm19.;

  format dt1 datetime19.;

  input group1 recnum1 dt1;

  cards;

1 1 10/01/2012:00:00:10

1 2 10/01/2012:00:00:30

1 3 10/01/2012:00:00:50

1 4 10/01/2012:00:01:10

1 5 10/01/2012:00:01:30

2 1 10/01/2012:00:00:11

2 2 10/01/2012:00:00:31

2 3 10/01/2012:00:00:51

2 4 10/01/2012:00:01:11

2 5 10/01/2012:00:01:31

3 1 10/01/2012:00:00:12

3 2 10/01/2012:00:00:32

3 3 10/01/2012:00:00:52

3 4 10/01/2012:00:01:12

3 5 10/01/2012:00:01:32

;

data have2;

  informat dt2 anydtdtm19.;

  format dt2 datetime19.;

  input group2 recnum2 dt2;

  cards;

1 1 10/01/2012:00:00:15

1 2 10/01/2012:00:00:16

1 3 10/01/2012:00:00:35

1 4 10/01/2012:00:00:37

1 5 10/01/2012:00:00:55

1 6 10/01/2012:00:01:18

1 7 10/01/2012:00:01:45

2 1 10/01/2012:00:00:16

2 2 10/01/2012:00:00:17

2 3 10/01/2012:00:00:36

2 4 10/01/2012:00:00:38

2 5 10/01/2012:00:00:56

2 6 10/01/2012:00:01:19

2 7 10/01/2012:00:01:46

3 1 10/01/2012:00:00:15

3 2 10/01/2012:00:00:16

3 3 10/01/2012:00:00:35

3 4 10/01/2012:00:00:37

3 5 10/01/2012:00:00:55

3 6 10/01/2012:00:01:18

3 7 10/01/2012:00:01:45

;

Linlin
Lapis Lazuli | Level 10

Hi Art,

Would you  please post a simple question so I would be able to help?

art297
Opal | Level 21

: How often to I EVER post questions other than questions about what someone was asking for?  Bet you could have answered this one if you had tried!

Haikuo
Onyx | Level 15

Hi Art,

Now I finally ditch Hiter():

data want;

  if _n_=1 then do;

  if 0 then set have2;

declare hash h(dataset:'have2', multidata:'y');

h.definekey('group2');

h.definedata(all:'y');

  h.definedone();

/*  declare hiter hi('h');*/

  end;

  set have1;

  _rc=h.find(key:group1);

  diff=dt2+dt1;;

do _rc=0 by 0 while (_rc=0);

_d=abs(dt2-dt1);

  if _d<diff then do;

  _num=recnum2;

  diff=_d;

_dt=dt2;

end;

_rc=h.find_next();

end;

  recnum2=_num;

dt2=_dt;

drop _:;

run;

Haikuo

art297
Opal | Level 21

: You outdid yourself.  This one only took 7.5 minutes to run a set of files that were four times larger than the first file.  I'm impressed!

: Amazingly fast, even faster than Haikuo's code, but had an error.  The last record is assigned to the wrong group.  Both records 4 and 5 should have been assigned to have2's record number 6, but only one was.

PGStats
Opal | Level 21

Hi Art, I havent lost all hope. Getting to the wrong answer the fastest and (I assume) with the smallest memory usage is already something! I think I fixed it :

data want (keep=recNum1 dt1 recNumC dtC);

retain recNumR dtR recNum2 dt2;

format dtC dtR datetime19.;

set have1;

do while (dt2 < dt1 and not have2end);

     if dtR <= dt2 then do;

          recNumR = recNum2;

          dtR = dt2;

          end;

     set have2 end=have2end;

     end;

if dt1 + dt1 - dt2 < dtR then do;

     dtC = dtR;

     recNumC = recNumR;

     end;

else do;

     dtC = dt2;

     recNumC = recNum2;

     end;

run;

PG

PG
art297
Opal | Level 21

: I'd have given you another helpful mark, but I've already used up my allowable ratings for this question.  I'm leaving Haikuo's response as being the correct one, as the thread did in fact ask for a hash solution.

However, that said, your suggested approach was definitely the best way to go and the one I actually ended up using.  On the actual data, the original proc sql approach was excessive, as it took eight hours and ate up over 500 gigabytes of disk space during the process.

Haikuo's approach was a significant improvement, only taking 7.5 minutes and only using up the disk space needed to accommodate the raw data*2.

I went with a variant of your code, though, and obtained the same results, only taking a total of 4.5 seconds, and only using up the disk space needed to accommodate the raw data*2.  In short, I'm extremely grateful that you suggested the sequential approach, as it was CLEARLY the best way to solve the problem.

PGStats
Opal | Level 21

Thanks Art. That's an astonishing improvement.  - PG

PG
TomKari
Onyx | Level 15

Hi,

That's an astonishing piece of code! It's a pattern that I've never seen before, nor thought of, but it's brilliant.

It's interesting that Art's original SQL is so clean, I would have thought this more difficult to express in SQL. But you can certainly see that it needs to go into a full-blown Cartesian to solve it, so I'm not surprised the duration is very long.

The hash solution is much faster, but at the expense of i) using up memory for the entire table, so at some volume the solution will fail; and ii) it still implements a cartesian-style double DOW loop.

In yours, for the simple cost of sorting both datasets, you've turned it into roughly a nobs(a) + nobs(b), instead of nobs(a) * nobs(b). Also, there are no memory implications, so you could scale this till you run out of disk.

Brilliant! My hat's off to you today.

BTW , how many records were in your real have1 and have2, just to get an idea of volumetrics?

Tom

art297
Opal | Level 21

: The real data had over 100,000 records per file, each file contained over 300 variables, and there were four by groups.  The variant of Pierre's code that I actually implemented sorted the files, did a proc transpose on have2, loaded the have2 datetimes into an array at the start of each by group, and then used Pierre's sequential search to obtain the closest values from the array.

As soon as I saw Pierre's code I realized that a sequential search was all that was needed.  In short, I totally agree with your summation.

Ksharp
Super User

Arthur,

Acutally you can use a simple array to get it, No reason for Hash Table .

data have1;
  informat dt1 anydtdtm19.;
  format dt1 datetime19.;
  input recnum1 dt1;
  cards;
1 10/01/2012:00:00:10
2 10/01/2012:00:00:30
3 10/01/2012:00:00:50
4 10/01/2012:00:01:10
5 10/01/2012:00:01:30
;
 
data have2;
  informat dt2 anydtdtm19.;
  format dt2 datetime19.;
  input recnum2 dt2;
  cards;
1 10/01/2012:00:00:15
2 10/01/2012:00:00:16
3 10/01/2012:00:00:35
4 10/01/2012:00:00:37
5 10/01/2012:00:00:55
6 10/01/2012:00:01:18
7 10/01/2012:00:01:45
;
data want;
 array a{10000} _temporary_;
 array b{10000} _temporary_;

 do until(last);
  set have2 end=last;
  n+1;
  a{n}=dt2;
  b{n}=recnum2;
 end;

do until(_last);
set have1 end=_last;
temp=99999;
 do i=1 to n;
  if abs(a{i}-dt1) lt temp then do;ii=i;min=abs(a{i}-dt1);end;
  temp=abs(a{i}-dt1);
 end;
dt2=a{ii};
recnum2=b{ii};
 output;
end;
drop i n ii  temp;
run;

Ksharp

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 2363 views
  • 6 likes
  • 6 in conversation