Hi Mark,
In this article on lookup, I can propose a hybrid approach:
For big tables like customer and orders, the sort + datastep merge is the fasted technique and can stay below a 2Gb Memory limit.
In order to avoid multiple data step merge steps, you could include 2 hash tables to lookup country and product during the data step merge of orders and customer.
That way you get the best of 2 worlds: fast on disk lookups with data step merge for bigger tables and in memory hash tables for lookups with smaller tables.
I also pay attention to the length of the 2 variables product and name, since the function CATX defaults to a length of 200bytes, this could increase IO.
So I initialize the variables Product to $85 and Name to $50 with a length statement, just not to waste any diskspace when writing them out.
Total Elapsed Time for this on my computer is between 00:15 and 00:20 secs, same as FEDSQL and sometimes even faster.
With the option fullstimer, you can also monitor Memory usage, next to elapsed time and cpu time.
With Proc FEDSQL _method, you can see that it is using a HashJoin in the backend. Also impressive is that the FedSQL reports using limited amounts of RAM Memory (around 130Mb).
My Sort + Merge uses for sorting up to 1520Mb or RAM Memory (i have a configuration options of memsize of 8G, sortsize setting of 4G), the data step merge uses only 2Mb or Ram Memory.
The Hash Complete method only uses 103Mb of Memory and completes in 00:43secs.
Proc SQL _method shows that the lookups use 4 sorts in the backend, and can use up to 00:46secs of time when having up to 3150Mb of Memory at its disposal to perform the sorts and join merge.
In more constrained environments where less Memory is available or where disk IO is not very fast SSD disks, then the performance of this step can degrade to more than 03min.
/******************************************************************************
DATA step merge - Hash Hybrid Final
******************************************************************************/
/******************************************************************************
Preliminaries: Clear out the WORK library
******************************************************************************/
proc datasets library=work kill nolist nodetails;
run; quit;
%let StartTime=%qsysfunc(time());
proc sort data=lookup.orders out=orders;
by CustomerID;
run;
data dsmergehash_final;
if _n_=1 then do;
if 0 then do;
/* Get the lookup table variables into the PDV */
set lookup.countrycodes;
set lookup.Products(keep=ProductID Item Manufacturer);
/*set lookup.Customers(keep=CustomerID FirstName MI LastName);*/
end;
/* Initialize the CountryCodes hash */
declare hash codes(dataset:'lookup.countrycodes');
codes.definekey('Country');
codes.definedata('Country_Name');
codes.defineDone();
/* Initialize the Products hash */
declare hash hProd(dataset:'lookup.Products ');
hProd.definekey('ProductID');
hProd.definedata('Item','Manufacturer');
hProd.defineDone();
end;
merge orders (in=ordered)
lookup.Customers (keep=CustomerID FirstName MI LastName) ;
by CustomerID;
rc=codes.find();
rc=hProd.find();
length Product $85;
Product=catx(' by ',Item,Manufacturer);
length Name $50;
Name=catx('. ',catx(' ',FirstName,MI),LastName);
if ordered;
drop rc Item Manufacturer FirstName MI LastName;
run;
%let EndTime=%qsysfunc(time());
%let ElapsedTime=%qsysfunc(putn(%sysevalf(%superq(EndTime)-%superq(StartTime)),e8601tm.));
%put NOTE: &=ElapsedTime;
/******************************************************************************
Data validation
******************************************************************************/
title "DSMERGEHash_FINAL Info";
proc sql;
select memname, nlobs format=comma12., nvar
from dictionary.tables
where libname='WORK'
and memname ='DSMERGEHASH_FINAL'
;
quit;
title;
... View more