Inspired by Nicolas Robert's CAS Data Manipulation series, I'll focus this post on performing lookups in CAS. By "lookup," I mean a non-Cartesian, left join of a big table with a small table. Basically augmenting a transaction/analytical table with some reference data like adding a product description to an order table.
Sample Lookup
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Like Nicolas' other scenarios, CAS gives us numerous ways to perform lookups. Let's explore some of the options here and see which one works best. .
This post’s test results were produced on a 4 worker node CAS system.
The data used in the tests consists of:
- A 14 Million row Sales table (our "big" table that we're augmenting)
- Three "lookup" tables
- a Facility table with 130 rows/unique IDs
- A Product table with 779 rows/unique IDs
- a Vendor table with 148,567 rows/unique IDs
.
The following techniques were considered and tested on all three lookup tables against the sales table:
- DS2 Merge
- DATA Step Merge
- FedSQL Join
- Format Join
- Hash Object Lookup
While the code changed slightly for each different lookup table (e.g. productSKU is used to lookup productMake and facility is used to lookup facilityStateProv, ...), the basic structure of each lookup technique is shown below. In each case, we are looking up the productMake field from the insightProduct table by the productSKU field to add it to the insightToySales table.
.
proc ds2 sessref=mySession ;
thread join_th / overwrite=yes ;
method run() ;
merge dm.insightToySales(in=a)
dm.insightProduct (in=b keep=(productSKU productMake)) /RETAIN;
by productSKU ;
if a;
end ;
endthread ;
data dm.joinDS2(copies=0) / overwrite=yes ;
dcl thread join_th t ;
method run() ;
set from t threads=200 ;
end ;
enddata ;
run ;
quit ;
data dm.JoinMerge;
merge dm.insightToySales (in=a) dm.insightProduct (in=b keep=productSKU productMake);
by productSKU;
if a;
run;
proc fedsql sessref=mySession _method;
create table dm.joinSQL {options replace=true} as
select a.*, b.productMake
from dm.insightToySales a left join dm.insightProduct b
on a.productSKU = b.productSKU;
quit;
proc format cntlin=dm.insightProduct (rename=(productSKU=start productMake=Label))
sessref=mySession casfmtlib="USERFORMATS2";
run;
data dm.joinformat(copies=0 replace=yes);
set DM.INSIGHTTOYSALES;
productMake = put(productSKU,$Product.);
run;
data dm.joinHash(copies=0 replace=yes ) ;
length productSKU $6 productMake $16;
if _N_=1 then do ;
declare hash h(dataset: "dm.insightProduct") ;
h.defineKey('productSKU') ;
h.defineData('productMake') ;
h.defineDone() ;
call missing(productSKU,productMake) ;
end ;
set dm.insightToySales ;
rc=h.find() ;
run ;
.
.
Find more articles from SAS Global Enablement and Learning here.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.