Solved: Re: Cartesian Product using the HASH

venkibhu14 · Posted 11-05-2021 12:18 PM

Hi All,

Can someone provide me a code for a cartesian product using HASH!

Thanks,

Venkat.

sbxkoenk · Posted 11-05-2021 12:23 PM

Hello,

See my solution to this question :

Macro is taking a long time to run - Risk set sampling
Posted 03-24-2021 11:44 AM
https://communities.sas.com/t5/SAS-Programming/Macro-is-taking-a-long-time-to-run-Risk-set-sampling/...

I'm doing a Cartesian product over there using a hash table (instead of the 'traditional' SQL-way to do the same).

Cheers,
Koen

View solution in original post

sbxkoenk · Posted 11-05-2021 12:23 PM

Hello,

See my solution to this question :

Macro is taking a long time to run - Risk set sampling
Posted 03-24-2021 11:44 AM
https://communities.sas.com/t5/SAS-Programming/Macro-is-taking-a-long-time-to-run-Risk-set-sampling/...

I'm doing a Cartesian product over there using a hash table (instead of the 'traditional' SQL-way to do the same).

Cheers,
Koen

venkibhu14 · Posted 11-05-2021 01:12 PM

Hi Koen,

Thanks a lot for a quick reply. Its working...:)

yabwon · Posted 11-05-2021 02:12 PM

if you need cartesian product in a datastep you can do it without using hash:

data x;
input x;
cards;
1
2
3
;
run;

data y;
input y;
cards;
10
20
30
;
run;

data x_times_y;
  set x;
  do point= 1 to nobs;
    set y nobs=nobs point=point;
    output;
  end;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

PeterClemmensen · Posted 11-05-2021 02:43 PM

Another Hash Approach

data x;
input x;
cards;
1
2
3
;
run;

data y;
input y;
cards;
10
20
30
;
run;

data want;
   if _N_ = 1 then do;
      dcl hash h(dataset : "y");
      h.definekey("y");
      h.definedata(all : "Y");
      h.definedone();
      dcl hiter i("h");
   end;
   
   set x;
   y = .;
   
   do while (i.next() = 0);
      output;
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

yabwon · Posted 11-05-2021 03:21 PM

why not just:

   if _N_ = 1 then do;
      if 0 then set y;
      dcl hash h(dataset : "y");
      h.defineKey(all : "Y");
      h.defineDone();
      dcl hiter i("h");
   end;

?

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

PeterClemmensen · Posted 11-05-2021 03:30 PM

@yabwon Just figured if y had more variables. Obviously you'd have to prepare the PDV 🙂

Hope everything is well.

The DATA to DATA Step Macro
Blog: SASnrd

yabwon · Posted 11-05-2021 05:04 PM

One is the PDV, but the second is that you don't need defineData() and save some memory.

B-)

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

ChrisNZ · Posted 11-05-2021 05:15 PM

You don't save memory. If definedata() is not used, the key list is used to fill the data portion. One of the many ways SAS hash tables waste memory.

High-Performance SAS Coding - Third Edition

yabwon · Posted 11-06-2021 03:26 AM

Hi Chris (@ChrisNZ),

Yes, you are absolutely right. Sorry for my mistake.

I read about this memory "issue" in Paul Dorfman's(@hashman) and Don Henderson's (@DonH) book but yesterday, when I wrote that one, I "mixed directions" in my head and didn't do the test to confirm (I did test below so others could see it too).

Bottom lines:

1) Peter's solution is more optimal (@PeterClemmensen),

2) if you need a hash table only for .check()-ing and have a long key then put one singe variable as data portion, e.g. _N_ to save some memory (shorter than 8 bytes won't help [last two examples below]).

Bart

[edit] P.S. It seems to be a good one for a SAS Ballot Idea: "optimise hash table memory usage".

1    options msglevel=I fullstimer;
2    data test;
3      do x = 1 to 1e7;
4        y = x;
5        z = x;
6        t = "a";
7        output;
8      end;
9    run;

NOTE: The data set WORK.TEST has 10000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.38 seconds
      user cpu time       0.23 seconds
      system cpu time     0.15 seconds
      memory              394.50k
      OS Memory           12532.00k

10
11
12   data _null_;
13     if 0 then set test;
14     dcl hash h(dataset : "test");
15     h.defineKey(all : "Y"); /* no data portion */
16     h.defineDone();
17     dcl hiter i("h");
18
19     stop;
20   run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
      real time           4.23 seconds
      user cpu time       3.70 seconds
      system cpu time     0.51 seconds
      memory              1180490.53k
      OS Memory           1191940.00k

21
22   data _null_;
23     if 0 then set test;
24     dcl hash h(dataset : "test");
25     h.defineKey("x");
26     h.defineData(all : "Y");
27     h.defineDone();
28     dcl hiter i("h");
29
30     stop;
31   run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
      real time           3.86 seconds
      user cpu time       3.43 seconds
      system cpu time     0.43 seconds
      memory              983883.81k
      OS Memory           995396.00k

32
33
34   data _null_;
35     if 0 then set test;
36     dcl hash h(dataset : "test");
37     h.defineKey(all : "Y");
38     h.defineData("z"); /* z is 8 bytes */
39     h.defineDone();
40
41     stop;
42   run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
      real time           3.90 seconds
      user cpu time       3.51 seconds
      system cpu time     0.39 seconds
      memory              983857.75k
      OS Memory           995396.00k

43
44   data _null_;
45     if 0 then set test;
46     dcl hash h(dataset : "test");
47     h.defineKey(all : "Y");
48     h.defineData("t"); /* t is 1 byte */
49     h.defineDone();
50
51     stop;
52   run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
      real time           3.79 seconds
      user cpu time       3.45 seconds
      system cpu time     0.34 seconds
      memory              983857.75k
      OS Memory           995396.00k

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

ChrisNZ · Posted 11-06-2021 03:40 AM

Nice! Have you compared to proc sql?

High-Performance SAS Coding - Third Edition

yabwon · Posted 11-06-2021 03:55 AM

Do you mean:

select x.*,y.* from x,y;

vs. data step with hash, vs. data step with point?

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

yabwon · Posted 11-06-2021 05:43 AM

Tests for small tables (10K obs,) are below. I don't have space for bigger sets on my laptop.

In terms of "real time" it is: 1) SQL (even done by SQXJSL = step loop join) , 2) hash, 3) point (even when whole table is in ram).

In terms of "memory" it is: 1) point, 2) hash, 3) SQL.

In terms of "OS memory" everything was less than 30MB so...

Bart

1    options msglevel=I fullstimer;
2    data x;
3      do x = 1 to 1e4;
4        x1 = x;
5        x2 = x;
6        x3 = "a";
7        output;
8      end;
9    run;

NOTE: The data set WORK.X has 10000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              395.03k
      OS Memory           24060.00k

10
11   data y;
12     do y = 1 to 1e4;
13       y1 = y;
14       y2 = y;
15       y3 = "b";
16       output;
17     end;
18   run;

NOTE: The data set WORK.Y has 10000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.01 seconds
      system cpu time     0.00 seconds
      memory              395.03k
      OS Memory           24060.00k

19
20   proc sql;
21   create table test1 as
22   select x.*, y.* from x,y
23   ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: Table WORK.TEST1 created, with 100000000 rows and 8 columns.

24   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           8.60 seconds
      user cpu time       6.51 seconds
      system cpu time     1.82 seconds
      memory              5633.68k
      OS Memory           29184.00k

25   data test2;
26     set x;
27     do point= 1 to nobs;
28       set y nobs=nobs point=point;
29       output;
30     end;
31   run;

NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST2 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           22.25 seconds
      user cpu time       15.43 seconds
      system cpu time     6.62 seconds
      memory              930.18k
      OS Memory           24572.00k

32   data test3;
33      if _N_ = 1 then do;
34         if 0 then set y;
35         dcl hash h(dataset : "y");
36         h.definekey("y");
37         h.definedata(all : "Y");
38         h.definedone();
39         dcl hiter i("h");
40      end;
41
42      set x;
43
44      do while (i.next() = 0);
45         output;
46      end;
47   run;

NOTE: There were 10000 observations read from the data set WORK.Y.
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST3 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           13.32 seconds
      user cpu time       11.21 seconds
      system cpu time     2.03 seconds
      memory              2446.78k
      OS Memory           26068.00k

48   sasfile y load;
NOTE: The file WORK.Y.DATA has been loaded into memory by the SASFILE statement.
49   data test4;
50     set x;
51     do point= 1 to nobs;
52       set y nobs=nobs point=point;
53       output;
54     end;
55   run;

NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST4 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           22.38 seconds
      user cpu time       16.56 seconds
      system cpu time     5.81 seconds
      memory              713.75k
      OS Memory           25408.00k

56   sasfile y close;
NOTE: The file WORK.Y.DATA has been closed by the SASFILE statement.

SQL "under the hood":

1    proc sql feedback _tree _method;
2    create table test1 as
3    select x.*, y.* from x,y
4    ;
NOTE: Statement transforms to:

        select X.x, X.x1, X.x2, X.x3, Y.y, Y.y1, Y.y2, Y.y3
          from WORK.X, WORK.Y;

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjsl
              sqxsrc( WORK.X )
              sqxsrc( WORK.Y )

Tree as planned.
                               /-SYM-V-(x.x:1 flag=00000001)
                     /-OBJ----|
                    |         |--SYM-V-(x.x1:2 flag=00000001)
                    |         |--SYM-V-(x.x2:3 flag=00000001)
                    |         |--SYM-V-(x.x3:4 flag=00000001)
                    |         |--SYM-V-(y.y:1 flag=00000001)
                    |         |--SYM-V-(y.y1:2 flag=00000001)
                    |         |--SYM-V-(y.y2:3 flag=00000001)
                    |          \-SYM-V-(y.y3:4 flag=00000001)
           /-JOIN---|
          |         |                              /-SYM-V-(x.x:1 flag=00000001)
          |         |                    /-OBJ----|
          |         |                   |         |--SYM-V-(x.x1:2 flag=00000001)
          |         |                   |         |--SYM-V-(x.x2:3 flag=00000001)
          |         |                   |          \-SYM-V-(x.x3:4 flag=00000001)
          |         |          /-SRC----|
          |         |         |          \-TABL[WORK].x opt=''
          |          \-FROM---|
          |                   |                    /-SYM-V-(y.y:1 flag=00000001)
          |                   |          /-OBJ----|
          |                   |         |         |--SYM-V-(y.y1:2 flag=00000001)
          |                   |         |         |--SYM-V-(y.y2:3 flag=00000001)
          |                   |         |          \-SYM-V-(y.y3:4 flag=00000001)
          |                    \-SRC----|
          |                              \-TABL[WORK].y opt=''
 --SSEL---|


NOTE: Table WORK.TEST1 created, with 100000000 rows and 8 columns.

5    quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           8.52 seconds
      user cpu time       6.43 seconds
      system cpu time     1.87 seconds
      memory              5647.87k
      OS Memory           30464.00k

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

ChrisNZ · Posted 11-06-2021 08:59 PM

Very comprehensive benchmark, thank you. I'd have been disappointed if SQL was slower.

High-Performance SAS Coding - Third Edition

hashman · Posted 11-06-2021 11:21 PM

Hey @yabwon / @ChrisNZ / @DonH:

Forming a CP is a matter of reading X one record at a time and pairing it with each record from Y sequentially. Sequential read is a not POINT='s forte, nor it is a hash object's forte, as the iterator has a significant underlying software overhead. Both rather excel at searching for a given obs number or key-value. The most natural way of speeding up repeated sequential scans is a temp array because it reads from memory and does it fast. Hence, for example - using Bart's test data (and not worrying about hardcoding):

data test_arr ;                                                                                                                                                                                                                                                 
  array ya  [10000]    _temporary_ ;                                                                                                                                                                                                                            
  array y1a [10000]    _temporary_ ;                                                                                                                                                                                                                            
  array y2a [10000]    _temporary_ ;                                                                                                                                                                                                                            
  array y3a [10000] $1 _temporary_ ;                                                                                                                                                                                                                            
  if _n_ = 1 then do until (z) ;                                                                                                                                                                                                                                
    set y end = z;                                                                                                                                                                                                                                              
    ya [_n_] = y  ;                                                                                                                                                                                                                                             
    y1a[_n_] = y1 ;                                                                                                                                                                                                                                             
    y2a[_n_] = y2 ;                                                                                                                                                                                                                                             
    y3a[_n_] = y3 ;                                                                                                                                                                                                                                             
  end ;                                                                                                                                                                                                                                                         
  set x ;                                                                                                                                                                                                                                                       
  do _n_ = 1 to 10000 ;                                                                                                                                                                                                                                         
    y  = ya [_n_] ;                                                                                                                                                                                                                                             
    y1 = y1a[_n_] ;                                                                                                                                                                                                                                             
    y2 = y2a[_n_] ;                                                                                                                                                                                                                                             
    y3 = y3a[_n_] ;                                                                                                                                                                                                                                             
    output ;                                                                                                                                                                                                                                                    
  end ;                                                                                                                                                                                                                                                         
run ;

actually outperforms SQL by about 20 percent in my tests on the same laptop I'm typing this. Which kind of makes me ideate that SQL does kind of the same sort of thing behind-the-scenes.

Kind regards

Paul D.

Registration is open

SAS Training: Just a Click Away