BookmarkSubscribeRSS Feed

CAS is Fast - Left Join

Started ‎08-26-2022 by
Modified ‎08-26-2022 by
Views 923

On my last post, a commenter asked about performing a left join with CAS versus foundation SAS ("Compute"). So I figured I'd give it a shot. I used the Orion Star's Order Fact and Customer Dim tables for the join and manipulated the data to achieve various record counts. The code is included below so you can see the actual tests. The results are summarized in the following table.


CAS vs Compute (Foundation SAS) Left Join Execution Time

Order Fact Row Count Customer Dim Row Count CAS Foundation
96,118,569 9,085,354 15.2 06:28.0
238,868,919 22,578,454 38.3 08:18.4
476,786,169 45,066,954 01:49.1 21:03.2
714,703,419 67,555,454 08:40.4 ******
952,620,669 90,043,954 11:35.4 ******
1,190,537,919 112,532,454 15:06.6 ******

 

******SAS Foundation ran out of space on the test since it is limited to a single server.

 

Note the following:

  • For CAS, we used a hash object which was shown to be the fastest left join CAS method previously.
  • For Foundation SAS (aka "Compute"), we used a Proc SQL Left Join. The hash object was preferred but failed due to physical server limitations at low volumes. Likewise a DATA step MERGE approach also ran out of space rather quickly because of the need to sort the data which temporarily creates two copies doubling the space required.
  • CAS easily outperformed foundation SAS in every data volume tested.
  • CAS was able to process much larger data volumes due to its multi-server nature. Compute was unable to process the equivalent data volumes simply because they wouldn't fit on a single server.

 

There was also a concern about CAS potentially thrashing at data volumes higher than available memory. I explored this a bit and didn't see much evidence of thrashing. For example in final test, the combined size of the order fact and customer dim tables was 264GB which is well over the combined available memory for the three worker nodes, 186GB. Despite this the performance curve seems more or less linear -- 714M rows at 8:40, 952M rows at 11:35, and 1.19B rows at 15:06.

 

Want to see the code?

 

If you want to the code from these tests so you can try something similar in your environment, use the code provided below.  You'll have to customize it to fit your data but it should give you a huge headstart.

 

CAS Left Join Test code:

 

options cashost="sas-cas-server-default-client" casport=5570;
CAS mySession SESSOPTS=(metrics=true);
CASLIB _all_ assign;

libname geldat cvp "/gelcontent/data/OrionStarData/orstar";
libname sasdm "/gelcontent/data";

/* Remove the formats and the Ceda*/
data sasdm.custDimUTF;
format Customer_Country $4. Customer_Gender $2.;
set geldat.customer_dim;
run;

data sasdm.orderFactUTF;
format Order_Type 12.;
set geldat.order_fact;
run;

/* Load the tables to CAS */
proc casutil ;
   load casdata="orderFactUTF.sas7bdat" incaslib="dm" casout="orderFactUTF" outcaslib="dm" copies=0
   importoptions=(filetype="basesas" dataTransferMode="parallel") ;
quit ;

proc casutil ;
   load casdata="custDimUTF.sas7bdat" incaslib="dm" casout="custDimUTF" outcaslib="dm" copies=0
   importoptions=(filetype="basesas" dataTransferMode="parallel") ;
quit ;


/* Increase the row count */
data dm.custDimBig(copies=0 duplicate=yes replace=yes) ;
     set dm.custDimUTF;
  do i = 0 to 1250;
     customer_id = customer_id + (i * 100000);
     output;
  end;
run;

data dm.orderFactBig(copies=0 replace=yes) ;
     set dm.orderFactUTF;
  do i = 0 to 1250;
     customer_id = customer_id + (i * 100000);
     output;
  end;
run;

proc cas;
table.tableDetails / caslib="dm" table="custDimBig";
table.tableDetails / caslib="dm" table="orderFactBig";
quit;

/* Run the test */

/* hash join */

data dm.joinHash(copies=0 replace=yes ) ;
length customer_id 8 customer_country $4 customer_gender $2;

   if _N_=1 then do ;
      declare hash h(dataset: "dm.custDimBig") ;
      h.defineKey('customer_id') ;
      h.defineData('Customer_Country','Customer_Gender') ;
      h.defineDone() ;
   end ;
   set dm.orderFactBig;
   rc=h.find() ;
run ;

proc cas;

table.fetch / to=500 table={caslib="dm" name="joinHash"};

quit;

 

Foundation SAS Left Join Test code:

 

libname geldat cvp "/gelcontent/data/OrionStarData/orstar";
libname sasdm "/gelcontent/data";

/* Remove the formats and the Ceda*/
data sasdm.custDimUTF;
format Customer_Country $4. Customer_Gender $2.;
set geldat.customer_dim;
run;

data sasdm.orderFactUTF;
format Order_Type 12.;
set geldat.order_fact;
run;

/* Increase the row count */
data sasdm.custDimBig ;
     set sasdm.custDimUTF;
  do i = 0 to 250;
     customer_id = customer_id + (i * 100000);
     output;
  end;
run;

data sasdm.orderFactBig;
     set sasdm.orderFactUTF;
  do i = 0 to 250;
     customer_id = customer_id + (i * 100000);
     output;
  end;
run;

/* Run the test */

/* sql join */

proc sql noprint;
create table sasdm.sqlJoin as
select a.*, b.customer_country, b.customer_gender
  from sasdm.orderFactBig a left join sasdm.custDimBig b
    on a.customer_id = b.customer_id;
quit;

proc print data=sasdm.joinHash(obs=500);
run;

 

Version history
Last update:
‎08-26-2022 07:37 AM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags