Yep, arrays are fastest.
data TEST_FETCH ;
set X end=LASTOBS;
if 0 then set Y;
retain DSID;
if _N_=1 then do; DSID=open('Y'); call set (DSID); end;
else RC=rewind(DSID);
do _N_ = 1 to 10e3;
RC=fetch(DSID);
output ;
end ;
if LASTOBS then RC=close(DSID);
run ;
in 1m20s, compared to 45s for the array logic on my old server.
FETCH is still several times faster than POINT=.
Lots of great techniques have been discussed here.
When Paul and I worked together we would regularly have what can best be described as yeah, but conversations. So here is my yeah, but point.
The issue from my perspective is pretty straightforward, performance issues depend on a lot of factors and assuming that the observed results for a given set of data (or combinations of data tables) applies across the board is questionable at best.
Lots of factors play into this. When creating a cartesian product, the size of the data sets probably matters - both in terms of the number of rows as well as the number of columns (as well as the total length of the columns).
When I did performance evaluations, I tried my best to use data sets that looked like the data for the application at hand. All of the approaches presented here are worthy of evaluation. But they need to be evaluated in the context of the particular sets of data at issue.
Don, I remember those "yeah, but"s quite well and rather fondly.
To what you've butted here, I have but a buttless yeah.
Kind regards
Paul D
Frankly I'm not surprised that FETCH works faster than POINT=.
After all it is much more "SET-ish" [from doc.: "Reads the next non-deleted observation from a SAS data set"] than random pointing which seems to be more like FETCHOBS().
I put below log for all codes in one more "summary" run so we can see comparison on the same machine.
🙂
Bart
1 options msglevel=I fullstimer;
2 data x;
3 do x = 1 to 1e4;
4 x1 = x;
5 x2 = x;
6 x3 = "a";
7 output;
8 end;
9 run;
NOTE: The data set WORK.X has 10000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 393.71k
OS Memory 16376.00k
10
11
12 data y;
13 do y = 1 to 1e4;
14 y1 = y;
15 y2 = y;
16 y3 = "b";
17 output;
18 end;
19 run;
NOTE: The data set WORK.Y has 10000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 393.71k
OS Memory 16376.00k
20
21
22
23 proc sql;
24 create table test1 as
25 select x.*, y.* from x,y
26 ;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: Table WORK.TEST1 created, with 100000000 rows and 8 columns.
27 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 8.99 seconds
user cpu time 6.64 seconds
system cpu time 2.31 seconds
memory 5634.90k
OS Memory 21500.00k
30
31 data test2;
32 set x;
33 do point= 1 to nobs;
34 set y nobs=nobs point=point;
35 output;
36 end;
37 run;
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST2 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 24.86 seconds
user cpu time 18.00 seconds
system cpu time 7.04 seconds
memory 929.56k
OS Memory 16376.00k
40
41 data test3;
42 if _N_ = 1 then do;
43 if 0 then set y;
44 dcl hash h(dataset : "y");
45 h.definekey("y");
46 h.definedata(all : "Y");
47 h.definedone();
48 dcl hiter i("h");
49 end;
50
51 set x;
52
53 do while (i.next() = 0);
54 output;
55 end;
56 run;
NOTE: There were 10000 observations read from the data set WORK.Y.
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST3 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 14.17 seconds
user cpu time 11.82 seconds
system cpu time 2.12 seconds
memory 2446.90k
OS Memory 17360.00k
59
60 sasfile y load;
NOTE: The file WORK.Y.DATA has been loaded into memory by the SASFILE statement.
61 data test4;
62 set x;
63 do point= 1 to nobs;
64 set y nobs=nobs point=point;
65 output;
66 end;
67 run;
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST4 has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 25.56 seconds
user cpu time 18.50 seconds
system cpu time 6.87 seconds
memory 714.81k
OS Memory 16700.00k
68 sasfile y close;
NOTE: The file WORK.Y.DATA has been closed by the SASFILE statement.
71
72 data test_arr ;
73 array ya [10000] _temporary_ ;
74 array y1a [10000] _temporary_ ;
75 array y2a [10000] _temporary_ ;
76 array y3a [10000] $1 _temporary_ ;
77 if _n_ = 1 then do until (z) ;
78 set y end = z;
79 ya [_n_] = y ;
80 y1a[_n_] = y1 ;
81 y2a[_n_] = y2 ;
82 y3a[_n_] = y3 ;
83 end ;
84 set x ;
85 do _n_ = 1 to 10000 ;
86 y = ya [_n_] ;
87 y1 = y1a[_n_] ;
88 y2 = y2a[_n_] ;
89 y3 = y3a[_n_] ;
90 output ;
91 end ;
92 run ;
NOTE: There were 10000 observations read from the data set WORK.Y.
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST_ARR has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 7.56 seconds
user cpu time 4.92 seconds
system cpu time 2.28 seconds
memory 1757.28k
OS Memory 16924.00k
95
96 data TEST_FETCH ;
97 set X end=LASTOBS;
98 if 0 then set Y;
99 retain DSID;
100 if _N_=1 then do; DSID=open('Y'); call set (DSID); end;
101 else RC=rewind(DSID);
102 do _N_ = 1 to 1e4;
103 RC=fetch(DSID);
104 output ;
105 end ;
106 if LASTOBS then RC=close(DSID);
107 drop dsid rc;
108 run ;
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.TEST_FETCH has 100000000 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 13.32 seconds
user cpu time 10.42 seconds
system cpu time 2.70 seconds
memory 1033.56k
OS Memory 16376.00k
Correct. And there is also a minimum size for the combination of the key and data portions of the table. IIRC that size also depends on the OS. I don’t remember the details, but I’m sure @hashman does 😀.
The size depends on the architecture (32 or 64 bits).
On 64-bit platforms, 48 bytes per item minimum (including 8 bytes each for key and data), and item increments of 16 bytes, with 8 byte increment for each of key and data.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.