Yep, arrays are fastest.
data TEST_FETCH ;
set X end=LASTOBS;
if 0 then set Y;
retain DSID;
if _N_=1 then do; DSID=open('Y'); call set (DSID); end;
else RC=rewind(DSID);
do _N_ = 1 to 10e3;
RC=fetch(DSID);
output ;
end ;
if LASTOBS then RC=close(DSID);
run ;
in 1m20s, compared to 45s for the array logic on my old server.
FETCH is still several times faster than POINT=.
Lots of great techniques have been discussed here.
When Paul and I worked together we would regularly have what can best be described as yeah, but conversations. So here is my yeah, but point.
The issue from my perspective is pretty straightforward, performance issues depend on a lot of factors and assuming that the observed results for a given set of data (or combinations of data tables) applies across the board is questionable at best.
Lots of factors play into this. When creating a cartesian product, the size of the data sets probably matters - both in terms of the number of rows as well as the number of columns (as well as the total length of the columns).
When I did performance evaluations, I tried my best to use data sets that looked like the data for the application at hand. All of the approaches presented here are worthy of evaluation. But they need to be evaluated in the context of the particular sets of data at issue.
Don, I remember those "yeah, but"s quite well and rather fondly.
To what you've butted here, I have but a buttless yeah.
Kind regards
Paul D
Frankly I'm not surprised that FETCH works faster than POINT=.
After all it is much more "SET-ish" [from doc.: "Reads the next non-deleted observation from a SAS data set"] than random pointing which seems to be more like FETCHOBS().
I put below log for all codes in one more "summary" run so we can see comparison on the same machine.
🙂
Bart
1 options msglevel=I fullstimer; 2 data x; 3 do x = 1 to 1e4; 4 x1 = x; 5 x2 = x; 6 x3 = "a"; 7 output; 8 end; 9 run; NOTE: The data set WORK.X has 10000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 393.71k OS Memory 16376.00k 10 11 12 data y; 13 do y = 1 to 1e4; 14 y1 = y; 15 y2 = y; 16 y3 = "b"; 17 output; 18 end; 19 run; NOTE: The data set WORK.Y has 10000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 393.71k OS Memory 16376.00k 20 21 22 23 proc sql; 24 create table test1 as 25 select x.*, y.* from x,y 26 ; NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: Table WORK.TEST1 created, with 100000000 rows and 8 columns. 27 quit; NOTE: PROCEDURE SQL used (Total process time): real time 8.99 seconds user cpu time 6.64 seconds system cpu time 2.31 seconds memory 5634.90k OS Memory 21500.00k 30 31 data test2; 32 set x; 33 do point= 1 to nobs; 34 set y nobs=nobs point=point; 35 output; 36 end; 37 run; NOTE: There were 10000 observations read from the data set WORK.X. NOTE: The data set WORK.TEST2 has 100000000 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 24.86 seconds user cpu time 18.00 seconds system cpu time 7.04 seconds memory 929.56k OS Memory 16376.00k 40 41 data test3; 42 if _N_ = 1 then do; 43 if 0 then set y; 44 dcl hash h(dataset : "y"); 45 h.definekey("y"); 46 h.definedata(all : "Y"); 47 h.definedone(); 48 dcl hiter i("h"); 49 end; 50 51 set x; 52 53 do while (i.next() = 0); 54 output; 55 end; 56 run; NOTE: There were 10000 observations read from the data set WORK.Y. NOTE: There were 10000 observations read from the data set WORK.X. NOTE: The data set WORK.TEST3 has 100000000 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 14.17 seconds user cpu time 11.82 seconds system cpu time 2.12 seconds memory 2446.90k OS Memory 17360.00k 59 60 sasfile y load; NOTE: The file WORK.Y.DATA has been loaded into memory by the SASFILE statement. 61 data test4; 62 set x; 63 do point= 1 to nobs; 64 set y nobs=nobs point=point; 65 output; 66 end; 67 run; NOTE: There were 10000 observations read from the data set WORK.X. NOTE: The data set WORK.TEST4 has 100000000 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 25.56 seconds user cpu time 18.50 seconds system cpu time 6.87 seconds memory 714.81k OS Memory 16700.00k 68 sasfile y close; NOTE: The file WORK.Y.DATA has been closed by the SASFILE statement. 71 72 data test_arr ; 73 array ya [10000] _temporary_ ; 74 array y1a [10000] _temporary_ ; 75 array y2a [10000] _temporary_ ; 76 array y3a [10000] $1 _temporary_ ; 77 if _n_ = 1 then do until (z) ; 78 set y end = z; 79 ya [_n_] = y ; 80 y1a[_n_] = y1 ; 81 y2a[_n_] = y2 ; 82 y3a[_n_] = y3 ; 83 end ; 84 set x ; 85 do _n_ = 1 to 10000 ; 86 y = ya [_n_] ; 87 y1 = y1a[_n_] ; 88 y2 = y2a[_n_] ; 89 y3 = y3a[_n_] ; 90 output ; 91 end ; 92 run ; NOTE: There were 10000 observations read from the data set WORK.Y. NOTE: There were 10000 observations read from the data set WORK.X. NOTE: The data set WORK.TEST_ARR has 100000000 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 7.56 seconds user cpu time 4.92 seconds system cpu time 2.28 seconds memory 1757.28k OS Memory 16924.00k 95 96 data TEST_FETCH ; 97 set X end=LASTOBS; 98 if 0 then set Y; 99 retain DSID; 100 if _N_=1 then do; DSID=open('Y'); call set (DSID); end; 101 else RC=rewind(DSID); 102 do _N_ = 1 to 1e4; 103 RC=fetch(DSID); 104 output ; 105 end ; 106 if LASTOBS then RC=close(DSID); 107 drop dsid rc; 108 run ; NOTE: There were 10000 observations read from the data set WORK.X. NOTE: The data set WORK.TEST_FETCH has 100000000 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 13.32 seconds user cpu time 10.42 seconds system cpu time 2.70 seconds memory 1033.56k OS Memory 16376.00k
Correct. And there is also a minimum size for the combination of the key and data portions of the table. IIRC that size also depends on the OS. I don’t remember the details, but I’m sure @hashman does 😀.
The size depends on the architecture (32 or 64 bits).
On 64-bit platforms, 48 bytes per item minimum (including 8 bytes each for key and data), and item increments of 16 bytes, with 8 byte increment for each of key and data.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.