BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
venkibhu14
Calcite | Level 5

Hi All,

Can someone provide me a code for a cartesian product using HASH!

Thanks,

Venkat.

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Hello,

 

See my solution to this question :

Macro is taking a long time to run - Risk set sampling
Posted 03-24-2021 11:44 AM
https://communities.sas.com/t5/SAS-Programming/Macro-is-taking-a-long-time-to-run-Risk-set-sampling/...

 

I'm doing a Cartesian product over there using a hash table (instead of the 'traditional' SQL-way to do the same).

 

Cheers,
Koen

View solution in original post

21 REPLIES 21
sbxkoenk
SAS Super FREQ

Hello,

 

See my solution to this question :

Macro is taking a long time to run - Risk set sampling
Posted 03-24-2021 11:44 AM
https://communities.sas.com/t5/SAS-Programming/Macro-is-taking-a-long-time-to-run-Risk-set-sampling/...

 

I'm doing a Cartesian product over there using a hash table (instead of the 'traditional' SQL-way to do the same).

 

Cheers,
Koen

venkibhu14
Calcite | Level 5

Hi Koen, 

Thanks a lot for a quick reply. Its working...:)

yabwon
Amethyst | Level 16

if you need cartesian product in a datastep you can do it without using hash:

data x;
input x;
cards;
1
2
3
;
run;

data y;
input y;
cards;
10
20
30
;
run;

data x_times_y;
  set x;
  do point= 1 to nobs;
    set y nobs=nobs point=point;
    output;
  end;
run;

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



PeterClemmensen
Tourmaline | Level 20

Another Hash Approach

 

data x;
input x;
cards;
1
2
3
;
run;

data y;
input y;
cards;
10
20
30
;
run;

data want;
   if _N_ = 1 then do;
      dcl hash h(dataset : "y");
      h.definekey("y");
      h.definedata(all : "Y");
      h.definedone();
      dcl hiter i("h");
   end;
   
   set x;
   y = .;
   
   do while (i.next() = 0);
      output;
   end;
run;
yabwon
Amethyst | Level 16

why not just:

   if _N_ = 1 then do;
      if 0 then set y;
      dcl hash h(dataset : "y");
      h.defineKey(all : "Y");
      h.defineDone();
      dcl hiter i("h");
   end;

?

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



PeterClemmensen
Tourmaline | Level 20

@yabwon Just figured if y had more variables. Obviously you'd have to prepare the PDV 🙂

 

Hope everything is well.

yabwon
Amethyst | Level 16

One is the PDV, but the second is that you don't need defineData() and save some memory.

 

B-)

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



ChrisNZ
Tourmaline | Level 20

You don't save memory. If definedata() is not used, the key list is used to fill the data portion. One of the many ways SAS hash tables waste memory.

yabwon
Amethyst | Level 16

Hi Chris (@ChrisNZ),

 

Yes, you are absolutely right. Sorry for my mistake.

I read about this memory "issue" in Paul Dorfman's(@hashman) and Don Henderson's  (@DonH)  book but yesterday, when I wrote that one, I "mixed directions" in my head and didn't do the test to confirm (I did test below so others could see it too). 

 

Bottom lines:

1) Peter's solution is more optimal (@PeterClemmensen),

2) if you need a hash table only for .check()-ing and have a long key then put one singe variable as data portion, e.g. _N_ to save some memory (shorter than 8 bytes won't help [last two examples below]).

 

Bart

 

[edit] P.S. It seems to be a good one for a SAS Ballot Idea: "optimise hash table memory usage".

 

 

1    options msglevel=I fullstimer;
2    data test;
3      do x = 1 to 1e7;
4        y = x;
5        z = x;
6        t = "a";
7        output;
8      end;
9    run;

NOTE: The data set WORK.TEST has 10000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.38 seconds
      user cpu time       0.23 seconds
      system cpu time     0.15 seconds
      memory              394.50k
      OS Memory           12532.00k

10
11
12   data _null_;
13     if 0 then set test;
14     dcl hash h(dataset : "test");
15     h.defineKey(all : "Y"); /* no data portion */
16     h.defineDone();
17     dcl hiter i("h");
18
19     stop;
20   run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
      real time           4.23 seconds
      user cpu time       3.70 seconds
      system cpu time     0.51 seconds
      memory              1180490.53k
      OS Memory           1191940.00k

21
22   data _null_;
23     if 0 then set test;
24     dcl hash h(dataset : "test");
25     h.defineKey("x");
26     h.defineData(all : "Y");
27     h.defineDone();
28     dcl hiter i("h");
29
30     stop;
31   run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
      real time           3.86 seconds
      user cpu time       3.43 seconds
      system cpu time     0.43 seconds
      memory              983883.81k
      OS Memory           995396.00k

32
33
34   data _null_;
35     if 0 then set test;
36     dcl hash h(dataset : "test");
37     h.defineKey(all : "Y");
38     h.defineData("z"); /* z is 8 bytes */
39     h.defineDone();
40
41     stop;
42   run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
      real time           3.90 seconds
      user cpu time       3.51 seconds
      system cpu time     0.39 seconds
      memory              983857.75k
      OS Memory           995396.00k

43
44   data _null_;
45     if 0 then set test;
46     dcl hash h(dataset : "test");
47     h.defineKey(all : "Y");
48     h.defineData("t"); /* t is 1 byte */
49     h.defineDone();
50
51     stop;
52   run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
      real time           3.79 seconds
      user cpu time       3.45 seconds
      system cpu time     0.34 seconds
      memory              983857.75k
      OS Memory           995396.00k

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



ChrisNZ
Tourmaline | Level 20

Nice!  Have you compared to proc sql?

yabwon
Amethyst | Level 16

Do you mean: 

select x.*,y.* from x,y;

vs. data step with hash, vs. data step with point?

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



yabwon
Amethyst | Level 16

Tests for small tables (10K obs,) are below. I don't have space for bigger sets on my laptop.

 

In terms of "real time" it is: 1) SQL (even done by SQXJSL = step loop join) , 2) hash, 3) point (even when whole table is in ram).

 

In terms of "memory" it is: 1) point, 2) hash, 3) SQL.

 

In terms of "OS memory" everything was less than 30MB so... 

 

Bart

 

1    options msglevel=I fullstimer;
2    data x;
3      do x = 1 to 1e4;
4        x1 = x;
5        x2 = x;
6        x3 = "a";
7        output;
8      end;
9    run;

NOTE: The data set WORK.X has 10000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              395.03k
      OS Memory           24060.00k

10
11   data y;
12     do y = 1 to 1e4;
13       y1 = y;
14       y2 = y;
15       y3 = "b";
16       output;
17     end;
18   run;

NOTE: The data set WORK.Y has 10000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.01 seconds
      system cpu time     0.00 seconds
      memory              395.03k
      OS Memory           24060.00k

19
20   proc sql;
21   create table test1 as
22   select x.*, y.* from x,y
23   ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: Table WORK.TEST1 created, with 100000000 rows and 8 columns.

24   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           8.60 seconds
      user cpu time       6.51 seconds
      system cpu time     1.82 seconds
      memory              5633.68k
      OS Memory           29184.00k

25   data test2;
26     set x;
27     do point= 1 to nobs;
28       set y nobs=nobs point=point;
29       output;
30     end;
31   run;

NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST2 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           22.25 seconds
      user cpu time       15.43 seconds
      system cpu time     6.62 seconds
      memory              930.18k
      OS Memory           24572.00k

32   data test3;
33      if _N_ = 1 then do;
34         if 0 then set y;
35         dcl hash h(dataset : "y");
36         h.definekey("y");
37         h.definedata(all : "Y");
38         h.definedone();
39         dcl hiter i("h");
40      end;
41
42      set x;
43
44      do while (i.next() = 0);
45         output;
46      end;
47   run;

NOTE: There were 10000 observations read from the data set WORK.Y.
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST3 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           13.32 seconds
      user cpu time       11.21 seconds
      system cpu time     2.03 seconds
      memory              2446.78k
      OS Memory           26068.00k

48   sasfile y load;
NOTE: The file WORK.Y.DATA has been loaded into memory by the SASFILE statement.
49   data test4;
50     set x;
51     do point= 1 to nobs;
52       set y nobs=nobs point=point;
53       output;
54     end;
55   run;

NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST4 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           22.38 seconds
      user cpu time       16.56 seconds
      system cpu time     5.81 seconds
      memory              713.75k
      OS Memory           25408.00k

56   sasfile y close;
NOTE: The file WORK.Y.DATA has been closed by the SASFILE statement.

SQL "under the hood":

1    proc sql feedback _tree _method;
2    create table test1 as
3    select x.*, y.* from x,y
4    ;
NOTE: Statement transforms to:

        select X.x, X.x1, X.x2, X.x3, Y.y, Y.y1, Y.y2, Y.y3
          from WORK.X, WORK.Y;

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjsl
              sqxsrc( WORK.X )
              sqxsrc( WORK.Y )

Tree as planned.
                               /-SYM-V-(x.x:1 flag=00000001)
                     /-OBJ----|
                    |         |--SYM-V-(x.x1:2 flag=00000001)
                    |         |--SYM-V-(x.x2:3 flag=00000001)
                    |         |--SYM-V-(x.x3:4 flag=00000001)
                    |         |--SYM-V-(y.y:1 flag=00000001)
                    |         |--SYM-V-(y.y1:2 flag=00000001)
                    |         |--SYM-V-(y.y2:3 flag=00000001)
                    |          \-SYM-V-(y.y3:4 flag=00000001)
           /-JOIN---|
          |         |                              /-SYM-V-(x.x:1 flag=00000001)
          |         |                    /-OBJ----|
          |         |                   |         |--SYM-V-(x.x1:2 flag=00000001)
          |         |                   |         |--SYM-V-(x.x2:3 flag=00000001)
          |         |                   |          \-SYM-V-(x.x3:4 flag=00000001)
          |         |          /-SRC----|
          |         |         |          \-TABL[WORK].x opt=''
          |          \-FROM---|
          |                   |                    /-SYM-V-(y.y:1 flag=00000001)
          |                   |          /-OBJ----|
          |                   |         |         |--SYM-V-(y.y1:2 flag=00000001)
          |                   |         |         |--SYM-V-(y.y2:3 flag=00000001)
          |                   |         |          \-SYM-V-(y.y3:4 flag=00000001)
          |                    \-SRC----|
          |                              \-TABL[WORK].y opt=''
 --SSEL---|


NOTE: Table WORK.TEST1 created, with 100000000 rows and 8 columns.

5    quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           8.52 seconds
      user cpu time       6.43 seconds
      system cpu time     1.87 seconds
      memory              5647.87k
      OS Memory           30464.00k

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



ChrisNZ
Tourmaline | Level 20

Very comprehensive benchmark, thank you. I'd have been disappointed if SQL was slower.

hashman
Ammonite | Level 13

Hey @yabwon / @ChrisNZ / @DonH:

 

Forming a CP is a matter of reading X one record at a time and pairing it with each record from Y sequentially. Sequential read is a not POINT='s forte, nor it is a hash object's forte, as the iterator has a significant underlying software overhead. Both rather excel at searching for a given obs number or key-value. The most natural way of speeding up repeated sequential scans is a temp array because it reads from memory and does it fast. Hence, for example - using Bart's test data (and not worrying about hardcoding):

 

data test_arr ;                                                                                                                                                                                                                                                 
  array ya  [10000]    _temporary_ ;                                                                                                                                                                                                                            
  array y1a [10000]    _temporary_ ;                                                                                                                                                                                                                            
  array y2a [10000]    _temporary_ ;                                                                                                                                                                                                                            
  array y3a [10000] $1 _temporary_ ;                                                                                                                                                                                                                            
  if _n_ = 1 then do until (z) ;                                                                                                                                                                                                                                
    set y end = z;                                                                                                                                                                                                                                              
    ya [_n_] = y  ;                                                                                                                                                                                                                                             
    y1a[_n_] = y1 ;                                                                                                                                                                                                                                             
    y2a[_n_] = y2 ;                                                                                                                                                                                                                                             
    y3a[_n_] = y3 ;                                                                                                                                                                                                                                             
  end ;                                                                                                                                                                                                                                                         
  set x ;                                                                                                                                                                                                                                                       
  do _n_ = 1 to 10000 ;                                                                                                                                                                                                                                         
    y  = ya [_n_] ;                                                                                                                                                                                                                                             
    y1 = y1a[_n_] ;                                                                                                                                                                                                                                             
    y2 = y2a[_n_] ;                                                                                                                                                                                                                                             
    y3 = y3a[_n_] ;                                                                                                                                                                                                                                             
    output ;                                                                                                                                                                                                                                                    
  end ;                                                                                                                                                                                                                                                         
run ;

actually outperforms SQL by about 20 percent in my tests on the same laptop I'm typing this. Which kind of makes me ideate that SQL does kind of the same sort of thing behind-the-scenes.

 

Kind regards

Paul D.          

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 21 replies
  • 4575 views
  • 27 likes
  • 8 in conversation