Hi All,
Can someone provide me a code for a cartesian product using HASH!
Thanks,
Venkat.
Hello,
See my solution to this question :
Macro is taking a long time to run - Risk set sampling
Posted 03-24-2021 11:44 AM
https://communities.sas.com/t5/SAS-Programming/Macro-is-taking-a-long-time-to-run-Risk-set-sampling/...
I'm doing a Cartesian product over there using a hash table (instead of the 'traditional' SQL-way to do the same).
Cheers,
Koen
Hello,
See my solution to this question :
Macro is taking a long time to run - Risk set sampling
Posted 03-24-2021 11:44 AM
https://communities.sas.com/t5/SAS-Programming/Macro-is-taking-a-long-time-to-run-Risk-set-sampling/...
I'm doing a Cartesian product over there using a hash table (instead of the 'traditional' SQL-way to do the same).
Cheers,
Koen
Hi Koen,
Thanks a lot for a quick reply. Its working...:)
if you need cartesian product in a datastep you can do it without using hash:
data x;
input x;
cards;
1
2
3
;
run;
data y;
input y;
cards;
10
20
30
;
run;
data x_times_y;
set x;
do point= 1 to nobs;
set y nobs=nobs point=point;
output;
end;
run;
Bart
Another Hash Approach
data x;
input x;
cards;
1
2
3
;
run;
data y;
input y;
cards;
10
20
30
;
run;
data want;
if _N_ = 1 then do;
dcl hash h(dataset : "y");
h.definekey("y");
h.definedata(all : "Y");
h.definedone();
dcl hiter i("h");
end;
set x;
y = .;
do while (i.next() = 0);
output;
end;
run;
why not just:
if _N_ = 1 then do;
if 0 then set y;
dcl hash h(dataset : "y");
h.defineKey(all : "Y");
h.defineDone();
dcl hiter i("h");
end;
?
Bart
@yabwon Just figured if y had more variables. Obviously you'd have to prepare the PDV 🙂
Hope everything is well.
One is the PDV, but the second is that you don't need defineData() and save some memory.
B-)
You don't save memory. If definedata() is not used, the key list is used to fill the data portion. One of the many ways SAS hash tables waste memory.
Hi Chris (@ChrisNZ),
Yes, you are absolutely right. Sorry for my mistake.
I read about this memory "issue" in Paul Dorfman's(@hashman) and Don Henderson's (@DonH) book but yesterday, when I wrote that one, I "mixed directions" in my head and didn't do the test to confirm (I did test below so others could see it too).
Bottom lines:
1) Peter's solution is more optimal (@PeterClemmensen),
2) if you need a hash table only for .check()-ing and have a long key then put one singe variable as data portion, e.g. _N_ to save some memory (shorter than 8 bytes won't help [last two examples below]).
Bart
[edit] P.S. It seems to be a good one for a SAS Ballot Idea: "optimise hash table memory usage".
1 options msglevel=I fullstimer; 2 data test; 3 do x = 1 to 1e7; 4 y = x; 5 z = x; 6 t = "a"; 7 output; 8 end; 9 run; NOTE: The data set WORK.TEST has 10000000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.38 seconds user cpu time 0.23 seconds system cpu time 0.15 seconds memory 394.50k OS Memory 12532.00k 10 11 12 data _null_; 13 if 0 then set test; 14 dcl hash h(dataset : "test"); 15 h.defineKey(all : "Y"); /* no data portion */ 16 h.defineDone(); 17 dcl hiter i("h"); 18 19 stop; 20 run; NOTE: There were 10000000 observations read from the data set WORK.TEST. NOTE: DATA statement used (Total process time): real time 4.23 seconds user cpu time 3.70 seconds system cpu time 0.51 seconds memory 1180490.53k OS Memory 1191940.00k 21 22 data _null_; 23 if 0 then set test; 24 dcl hash h(dataset : "test"); 25 h.defineKey("x"); 26 h.defineData(all : "Y"); 27 h.defineDone(); 28 dcl hiter i("h"); 29 30 stop; 31 run; NOTE: There were 10000000 observations read from the data set WORK.TEST. NOTE: DATA statement used (Total process time): real time 3.86 seconds user cpu time 3.43 seconds system cpu time 0.43 seconds memory 983883.81k OS Memory 995396.00k 32 33 34 data _null_; 35 if 0 then set test; 36 dcl hash h(dataset : "test"); 37 h.defineKey(all : "Y"); 38 h.defineData("z"); /* z is 8 bytes */ 39 h.defineDone(); 40 41 stop; 42 run; NOTE: There were 10000000 observations read from the data set WORK.TEST. NOTE: DATA statement used (Total process time): real time 3.90 seconds user cpu time 3.51 seconds system cpu time 0.39 seconds memory 983857.75k OS Memory 995396.00k 43 44 data _null_; 45 if 0 then set test; 46 dcl hash h(dataset : "test"); 47 h.defineKey(all : "Y"); 48 h.defineData("t"); /* t is 1 byte */ 49 h.defineDone(); 50 51 stop; 52 run; NOTE: There were 10000000 observations read from the data set WORK.TEST. NOTE: DATA statement used (Total process time): real time 3.79 seconds user cpu time 3.45 seconds system cpu time 0.34 seconds memory 983857.75k OS Memory 995396.00k
Nice! Have you compared to proc sql?
Do you mean:
select x.*,y.* from x,y;
vs. data step with hash, vs. data step with point?
Bart
Tests for small tables (10K obs,) are below. I don't have space for bigger sets on my laptop.
In terms of "real time" it is: 1) SQL (even done by SQXJSL = step loop join) , 2) hash, 3) point (even when whole table is in ram).
In terms of "memory" it is: 1) point, 2) hash, 3) SQL.
In terms of "OS memory" everything was less than 30MB so...
Bart
1 options msglevel=I fullstimer; 2 data x; 3 do x = 1 to 1e4; 4 x1 = x; 5 x2 = x; 6 x3 = "a"; 7 output; 8 end; 9 run; NOTE: The data set WORK.X has 10000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 395.03k OS Memory 24060.00k 10 11 data y; 12 do y = 1 to 1e4; 13 y1 = y; 14 y2 = y; 15 y3 = "b"; 16 output; 17 end; 18 run; NOTE: The data set WORK.Y has 10000 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.01 seconds system cpu time 0.00 seconds memory 395.03k OS Memory 24060.00k 19 20 proc sql; 21 create table test1 as 22 select x.*, y.* from x,y 23 ; NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: Table WORK.TEST1 created, with 100000000 rows and 8 columns. 24 quit; NOTE: PROCEDURE SQL used (Total process time): real time 8.60 seconds user cpu time 6.51 seconds system cpu time 1.82 seconds memory 5633.68k OS Memory 29184.00k 25 data test2; 26 set x; 27 do point= 1 to nobs; 28 set y nobs=nobs point=point; 29 output; 30 end; 31 run; NOTE: There were 10000 observations read from the data set WORK.X. NOTE: The data set WORK.TEST2 has 100000000 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 22.25 seconds user cpu time 15.43 seconds system cpu time 6.62 seconds memory 930.18k OS Memory 24572.00k 32 data test3; 33 if _N_ = 1 then do; 34 if 0 then set y; 35 dcl hash h(dataset : "y"); 36 h.definekey("y"); 37 h.definedata(all : "Y"); 38 h.definedone(); 39 dcl hiter i("h"); 40 end; 41 42 set x; 43 44 do while (i.next() = 0); 45 output; 46 end; 47 run; NOTE: There were 10000 observations read from the data set WORK.Y. NOTE: There were 10000 observations read from the data set WORK.X. NOTE: The data set WORK.TEST3 has 100000000 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 13.32 seconds user cpu time 11.21 seconds system cpu time 2.03 seconds memory 2446.78k OS Memory 26068.00k 48 sasfile y load; NOTE: The file WORK.Y.DATA has been loaded into memory by the SASFILE statement. 49 data test4; 50 set x; 51 do point= 1 to nobs; 52 set y nobs=nobs point=point; 53 output; 54 end; 55 run; NOTE: There were 10000 observations read from the data set WORK.X. NOTE: The data set WORK.TEST4 has 100000000 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 22.38 seconds user cpu time 16.56 seconds system cpu time 5.81 seconds memory 713.75k OS Memory 25408.00k 56 sasfile y close; NOTE: The file WORK.Y.DATA has been closed by the SASFILE statement.
SQL "under the hood":
1 proc sql feedback _tree _method; 2 create table test1 as 3 select x.*, y.* from x,y 4 ; NOTE: Statement transforms to: select X.x, X.x1, X.x2, X.x3, Y.y, Y.y1, Y.y2, Y.y3 from WORK.X, WORK.Y; NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: SQL execution methods chosen are: sqxcrta sqxjsl sqxsrc( WORK.X ) sqxsrc( WORK.Y ) Tree as planned. /-SYM-V-(x.x:1 flag=00000001) /-OBJ----| | |--SYM-V-(x.x1:2 flag=00000001) | |--SYM-V-(x.x2:3 flag=00000001) | |--SYM-V-(x.x3:4 flag=00000001) | |--SYM-V-(y.y:1 flag=00000001) | |--SYM-V-(y.y1:2 flag=00000001) | |--SYM-V-(y.y2:3 flag=00000001) | \-SYM-V-(y.y3:4 flag=00000001) /-JOIN---| | | /-SYM-V-(x.x:1 flag=00000001) | | /-OBJ----| | | | |--SYM-V-(x.x1:2 flag=00000001) | | | |--SYM-V-(x.x2:3 flag=00000001) | | | \-SYM-V-(x.x3:4 flag=00000001) | | /-SRC----| | | | \-TABL[WORK].x opt='' | \-FROM---| | | /-SYM-V-(y.y:1 flag=00000001) | | /-OBJ----| | | | |--SYM-V-(y.y1:2 flag=00000001) | | | |--SYM-V-(y.y2:3 flag=00000001) | | | \-SYM-V-(y.y3:4 flag=00000001) | \-SRC----| | \-TABL[WORK].y opt='' --SSEL---| NOTE: Table WORK.TEST1 created, with 100000000 rows and 8 columns. 5 quit; NOTE: PROCEDURE SQL used (Total process time): real time 8.52 seconds user cpu time 6.43 seconds system cpu time 1.87 seconds memory 5647.87k OS Memory 30464.00k
Very comprehensive benchmark, thank you. I'd have been disappointed if SQL was slower.
Hey @yabwon / @ChrisNZ / @DonH:
Forming a CP is a matter of reading X one record at a time and pairing it with each record from Y sequentially. Sequential read is a not POINT='s forte, nor it is a hash object's forte, as the iterator has a significant underlying software overhead. Both rather excel at searching for a given obs number or key-value. The most natural way of speeding up repeated sequential scans is a temp array because it reads from memory and does it fast. Hence, for example - using Bart's test data (and not worrying about hardcoding):
data test_arr ;
array ya [10000] _temporary_ ;
array y1a [10000] _temporary_ ;
array y2a [10000] _temporary_ ;
array y3a [10000] $1 _temporary_ ;
if _n_ = 1 then do until (z) ;
set y end = z;
ya [_n_] = y ;
y1a[_n_] = y1 ;
y2a[_n_] = y2 ;
y3a[_n_] = y3 ;
end ;
set x ;
do _n_ = 1 to 10000 ;
y = ya [_n_] ;
y1 = y1a[_n_] ;
y2 = y2a[_n_] ;
y3 = y3a[_n_] ;
output ;
end ;
run ;
actually outperforms SQL by about 20 percent in my tests on the same laptop I'm typing this. Which kind of makes me ideate that SQL does kind of the same sort of thing behind-the-scenes.
Kind regards
Paul D.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.