On my last post, a commenter asked about performing a left join with CAS versus foundation SAS ("Compute"). So I figured I'd give it a shot. I used the Orion Star's Order Fact and Customer Dim tables for the join and manipulated the data to achieve various record counts. The code is included below so you can see the actual tests. The results are summarized in the following table.
CAS vs Compute (Foundation SAS) Left Join Execution Time
Order Fact Row Count | Customer Dim Row Count | CAS | Foundation |
---|---|---|---|
96,118,569 | 9,085,354 | 15.2 | 06:28.0 |
238,868,919 | 22,578,454 | 38.3 | 08:18.4 |
476,786,169 | 45,066,954 | 01:49.1 | 21:03.2 |
714,703,419 | 67,555,454 | 08:40.4 | ****** |
952,620,669 | 90,043,954 | 11:35.4 | ****** |
1,190,537,919 | 112,532,454 | 15:06.6 | ****** |
******SAS Foundation ran out of space on the test since it is limited to a single server.
Note the following:
There was also a concern about CAS potentially thrashing at data volumes higher than available memory. I explored this a bit and didn't see much evidence of thrashing. For example in final test, the combined size of the order fact and customer dim tables was 264GB which is well over the combined available memory for the three worker nodes, 186GB. Despite this the performance curve seems more or less linear -- 714M rows at 8:40, 952M rows at 11:35, and 1.19B rows at 15:06.
If you want to the code from these tests so you can try something similar in your environment, use the code provided below. You'll have to customize it to fit your data but it should give you a huge headstart.
CAS Left Join Test code:
options cashost="sas-cas-server-default-client" casport=5570;
CAS mySession SESSOPTS=(metrics=true);
CASLIB _all_ assign;
libname geldat cvp "/gelcontent/data/OrionStarData/orstar";
libname sasdm "/gelcontent/data";
/* Remove the formats and the Ceda*/
data sasdm.custDimUTF;
format Customer_Country $4. Customer_Gender $2.;
set geldat.customer_dim;
run;
data sasdm.orderFactUTF;
format Order_Type 12.;
set geldat.order_fact;
run;
/* Load the tables to CAS */
proc casutil ;
load casdata="orderFactUTF.sas7bdat" incaslib="dm" casout="orderFactUTF" outcaslib="dm" copies=0
importoptions=(filetype="basesas" dataTransferMode="parallel") ;
quit ;
proc casutil ;
load casdata="custDimUTF.sas7bdat" incaslib="dm" casout="custDimUTF" outcaslib="dm" copies=0
importoptions=(filetype="basesas" dataTransferMode="parallel") ;
quit ;
/* Increase the row count */
data dm.custDimBig(copies=0 duplicate=yes replace=yes) ;
set dm.custDimUTF;
do i = 0 to 1250;
customer_id = customer_id + (i * 100000);
output;
end;
run;
data dm.orderFactBig(copies=0 replace=yes) ;
set dm.orderFactUTF;
do i = 0 to 1250;
customer_id = customer_id + (i * 100000);
output;
end;
run;
proc cas;
table.tableDetails / caslib="dm" table="custDimBig";
table.tableDetails / caslib="dm" table="orderFactBig";
quit;
/* Run the test */
/* hash join */
data dm.joinHash(copies=0 replace=yes ) ;
length customer_id 8 customer_country $4 customer_gender $2;
if _N_=1 then do ;
declare hash h(dataset: "dm.custDimBig") ;
h.defineKey('customer_id') ;
h.defineData('Customer_Country','Customer_Gender') ;
h.defineDone() ;
end ;
set dm.orderFactBig;
rc=h.find() ;
run ;
proc cas;
table.fetch / to=500 table={caslib="dm" name="joinHash"};
quit;
Foundation SAS Left Join Test code:
libname geldat cvp "/gelcontent/data/OrionStarData/orstar";
libname sasdm "/gelcontent/data";
/* Remove the formats and the Ceda*/
data sasdm.custDimUTF;
format Customer_Country $4. Customer_Gender $2.;
set geldat.customer_dim;
run;
data sasdm.orderFactUTF;
format Order_Type 12.;
set geldat.order_fact;
run;
/* Increase the row count */
data sasdm.custDimBig ;
set sasdm.custDimUTF;
do i = 0 to 250;
customer_id = customer_id + (i * 100000);
output;
end;
run;
data sasdm.orderFactBig;
set sasdm.orderFactUTF;
do i = 0 to 250;
customer_id = customer_id + (i * 100000);
output;
end;
run;
/* Run the test */
/* sql join */
proc sql noprint;
create table sasdm.sqlJoin as
select a.*, b.customer_country, b.customer_gender
from sasdm.orderFactBig a left join sasdm.custDimBig b
on a.customer_id = b.customer_id;
quit;
proc print data=sasdm.joinHash(obs=500);
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.