CAS Table Lookup (Left Outer Join) on SAS Viya

5 Likes

Inspired by Nicolas Robert's CAS Data Manipulation series, I'll focus this post on performing lookups in CAS. By "lookup," I mean a non-Cartesian, left join of a big table with a small table. Basically augmenting a transaction/analytical table with some reference data like adding a product description to an order table.

Sample Lookup

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Like Nicolas' other scenarios, CAS gives us numerous ways to perform lookups. Let's explore some of the options here and see which one works best. .

The Tests

This post’s test results were produced on a 4 worker node CAS system.

The data used in the tests consists of:

A 14 Million row Sales table (our "big" table that we're augmenting)

Three "lookup" tables

a Facility table with 130 rows/unique IDs

A Product table with 779 rows/unique IDs

a Vendor table with 148,567 rows/unique IDs

.

The Lookup Techniques

The following techniques were considered and tested on all three lookup tables against the sales table:

DS2 Merge

DATA Step Merge

FedSQL Join

Format Join

Hash Object Lookup

While the code changed slightly for each different lookup table (e.g. productSKU is used to lookup productMake and facility is used to lookup facilityStateProv, ...), the basic structure of each lookup technique is shown below. In each case, we are looking up the productMake field from the insightProduct table by the productSKU field to add it to the insightToySales table.

.

DS2 Merge

proc ds2 sessref=mySession ;
   thread join_th / overwrite=yes ;
      method run() ;
         merge dm.insightToySales(in=a) 
               dm.insightProduct (in=b keep=(productSKU productMake)) /RETAIN;
         by productSKU ;
        if a;
      end ;
   endthread ;
   data dm.joinDS2(copies=0) / overwrite=yes ;
      dcl thread join_th t ;
      method run() ;
         set from t threads=200 ;
      end ;
   enddata ;
run ;
quit ;

DATA Step Merge

data dm.JoinMerge;
merge dm.insightToySales (in=a) dm.insightProduct (in=b keep=productSKU productMake);
   by productSKU;
   if a;
run;

FedSQL Left Join

proc fedsql sessref=mySession _method;
create table dm.joinSQL {options replace=true} as
select a.*, b.productMake
from  dm.insightToySales a left join  dm.insightProduct b
  on a.productSKU = b.productSKU;
quit;

Format

proc format cntlin=dm.insightProduct (rename=(productSKU=start productMake=Label))
            sessref=mySession casfmtlib="USERFORMATS2";
run;

data dm.joinformat(copies=0 replace=yes);
  set  DM.INSIGHTTOYSALES;
  productMake = put(productSKU,$Product.);
run;

Hash Object

data dm.joinHash(copies=0 replace=yes ) ;
length productSKU $6 productMake $16;

   if _N_=1 then do ;
      declare hash h(dataset: "dm.insightProduct") ;
      h.defineKey('productSKU') ;
      h.defineData('productMake') ;
      h.defineDone() ;
      call missing(productSKU,productMake) ;
   end ;
   set dm.insightToySales ;
   rc=h.find() ;
run ;

Results

.

Conclusions and Observations

While the format join marginally beat the hash object on the smallest lookup table, the hash object seems to be the best lookup technique for low cardinality lookups.
Loading the format increased substantially for the Vendor table which had by far the most rows. The format load went from 0.19 seconds for the 779 row product table to 5+ seconds for the 148,567 row Vendor table, while the format application (DATA Step with the put() statement) took roughly the same time for both.
The FedSQL Join Optimizer chose the hash join method for all three lookup tables. This makes sense since all of the lookup tables are relatively small.
Indexing should not improve any technique's performance since CAS indexes are only used for subsetting operations and not join optimization.
At lower data volumes (1.4 million rows instead of 14 million), all of the techniques performed more-or-less the same.
All of the lookup techniques ran 100% in CAS and in a multi-threaded manner.
All of the lookup techniques, except the format, can easily accommodate returning multiple fields from the lookup table. For example, the hash defineData method can accommodate multiple fields. Returning multiple field values using a single format requires some clever string manipulation.
All of the lookup techniques can accommodate performing lookups to different tables in a single pass of the transaction/analytical table. To do this with the hash object, you define multiple hashes within the driver DATA Step. To do it with the format, you load the formats in separate steps and then apply them all within the same DATA Step.
The format technique requires some pre-planning as you need to add a fmtName field with the format name value (e.g. '$Product') to your lookup table.
Remember that PROC SQL does not run in CAS and if you use it or you use non-CAS-enabled DATA Step functionality, your CAS data will be pulled from CAS into the SAS Compute server and your performance will be far worse than anything shown above.
Remember also that, to get good performance, you must minimize disk I/O and unnecessary passes of the data. So where possible, you'll want to do as much in a single step as possible.
We set the threads parameter in the DS2 SET FROM statement to 200 so that it would default to the highest number possible on the environment. In our case, it was adjusted down to 4.

.

Find more articles from SAS Global Enablement and Learning here.