Solved: Re: Cartesian Product using the HASH - Page 2

ChrisNZ · Posted 11-07-2021 08:13 PM

Yep, arrays are fastest.

data TEST_FETCH ;                                                                                                                                                                                                                                                 
  set X end=LASTOBS;   
  if 0 then set Y;
  retain DSID;
  if _N_=1 then do; DSID=open('Y'); call set (DSID); end;
  else RC=rewind(DSID);
  do _N_ = 1 to 10e3;                                                                                                                                                                                                                                         
    RC=fetch(DSID);
    output ;                                                                                                                                                                                                                                                    
  end ;   
  if LASTOBS then RC=close(DSID); 
run ;

in 1m20s, compared to 45s for the array logic on my old server.

FETCH is still several times faster than POINT=.

High-Performance SAS Coding - Third Edition

DonH · Posted 11-07-2021 09:00 PM

Lots of great techniques have been discussed here.
When Paul and I worked together we would regularly have what can best be described as yeah, but conversations. So here is my yeah, but point.
The issue from my perspective is pretty straightforward, performance issues depend on a lot of factors and assuming that the observed results for a given set of data (or combinations of data tables) applies across the board is questionable at best.
Lots of factors play into this. When creating a cartesian product, the size of the data sets probably matters - both in terms of the number of rows as well as the number of columns (as well as the total length of the columns).
When I did performance evaluations, I tried my best to use data sets that looked like the data for the application at hand. All of the approaches presented here are worthy of evaluation. But they need to be evaluated in the context of the particular sets of data at issue.

hashman · Posted 11-08-2021 10:10 PM

Don, I remember those "yeah, but"s quite well and rather fondly.

To what you've butted here, I have but a buttless yeah.

Kind regards

Paul D

yabwon · Posted 11-08-2021 03:21 AM

Frankly I'm not surprised that FETCH works faster than POINT=.

After all it is much more "SET-ish" [from doc.: "Reads the next non-deleted observation from a SAS data set"] than random pointing which seems to be more like FETCHOBS().

I put below log for all codes in one more "summary" run so we can see comparison on the same machine.

🙂

Bart

1    options msglevel=I fullstimer;
2    data x;
3     do x = 1 to 1e4;
4       x1 = x;
5       x2 = x;
6       x3 = "a";
7       output;
8     end;
9    run;

NOTE: The data set WORK.X has 10000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              393.71k
      OS Memory           16376.00k

10
11
12   data y;
13    do y = 1 to 1e4;
14      y1 = y;
15      y2 = y;
16      y3 = "b";
17      output;
18    end;
19   run;

NOTE: The data set WORK.Y has 10000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              393.71k
      OS Memory           16376.00k

20
21
22
23   proc sql;
24   create table test1 as
25   select x.*, y.* from x,y
26   ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: Table WORK.TEST1 created, with 100000000 rows and 8 columns.

27   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           8.99 seconds
      user cpu time       6.64 seconds
      system cpu time     2.31 seconds
      memory              5634.90k
      OS Memory           21500.00k

30
31   data test2;
32    set x;
33    do point= 1 to nobs;
34      set y nobs=nobs point=point;
35      output;
36    end;
37   run;

NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST2 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           24.86 seconds
      user cpu time       18.00 seconds
      system cpu time     7.04 seconds
      memory              929.56k
      OS Memory           16376.00k

40
41   data test3;
42     if _N_ = 1 then do;
43        if 0 then set y;
44        dcl hash h(dataset : "y");
45        h.definekey("y");
46        h.definedata(all : "Y");
47        h.definedone();
48        dcl hiter i("h");
49     end;
50
51     set x;
52
53     do while (i.next() = 0);
54        output;
55     end;
56   run;

NOTE: There were 10000 observations read from the data set WORK.Y.
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST3 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           14.17 seconds
      user cpu time       11.82 seconds
      system cpu time     2.12 seconds
      memory              2446.90k
      OS Memory           17360.00k

59
60   sasfile y load;
NOTE: The file WORK.Y.DATA has been loaded into memory by the SASFILE statement.
61   data test4;
62    set x;
63    do point= 1 to nobs;
64      set y nobs=nobs point=point;
65      output;
66    end;
67   run;

NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST4 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           25.56 seconds
      user cpu time       18.50 seconds
      system cpu time     6.87 seconds
      memory              714.81k
      OS Memory           16700.00k

68   sasfile y close;
NOTE: The file WORK.Y.DATA has been closed by the SASFILE statement.

71
72   data test_arr ;
73     array ya  [10000]    _temporary_ ;
74     array y1a [10000]    _temporary_ ;
75     array y2a [10000]    _temporary_ ;
76     array y3a [10000] $1 _temporary_ ;
77     if _n_ = 1 then do until (z) ;
78       set y end = z;
79       ya [_n_] = y  ;
80       y1a[_n_] = y1 ;
81       y2a[_n_] = y2 ;
82       y3a[_n_] = y3 ;
83     end ;
84     set x ;
85     do _n_ = 1 to 10000 ;
86       y  = ya [_n_] ;
87       y1 = y1a[_n_] ;
88       y2 = y2a[_n_] ;
89       y3 = y3a[_n_] ;
90       output ;
91     end ;
92   run ;

NOTE: There were 10000 observations read from the data set WORK.Y.
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST_ARR has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           7.56 seconds
      user cpu time       4.92 seconds
      system cpu time     2.28 seconds
      memory              1757.28k
      OS Memory           16924.00k

95
96   data TEST_FETCH ;
97     set X end=LASTOBS;
98     if 0 then set Y;
99     retain DSID;
100    if _N_=1 then do; DSID=open('Y'); call set (DSID); end;
101    else RC=rewind(DSID);
102    do _N_ = 1 to 1e4;
103      RC=fetch(DSID);
104      output ;
105    end ;
106    if LASTOBS then RC=close(DSID);
107    drop dsid rc;
108  run ;

NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST_FETCH has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           13.32 seconds
      user cpu time       10.42 seconds
      system cpu time     2.70 seconds
      memory              1033.56k
      OS Memory           16376.00k

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

DonH · Posted 11-06-2021 11:03 AM

Correct. And there is also a minimum size for the combination of the key and data portions of the table. IIRC that size also depends on the OS. I don’t remember the details, but I’m sure @hashman does 😀.

ChrisNZ · Posted 11-06-2021 08:58 PM

The size depends on the architecture (32 or 64 bits).

On 64-bit platforms, 48 bytes per item minimum (including 8 bytes each for key and data), and item increments of 16 bytes, with 8 byte increment for each of key and data.

High-Performance SAS Coding - Third Edition

Ksharp · Posted 11-06-2021 06:39 AM

Peter.C
You need consider duplicated key as well !

dcl hash h(dataset : "y", multidata:'y' );

Re: Cartesian Product using the HASH

Re: Cartesian Product using the HASH

Re: Cartesian Product using the HASH

Re: Cartesian Product using the HASH

Re: Cartesian Product using the HASH

Re: Cartesian Product using the HASH

Re: Cartesian Product using the HASH

Registration is open

SAS Training: Just a Click Away