Manipulating Data in Base SAS® Part 4 – Lookup

8 Likes

There are several different ways to do lookups in SAS. In this post, I’ll look at techniques that can produce results similar to a left join of a big table with a small table, like might be done to enrich a transaction table with additional information. The techniques covered will include PROC SQL and FedSQL joins, DATA step and DS2 merges, hash objects, and custom formats. I'll execute code for each technique several times, using a small, medium, and large lookup table to enrich the data a very large existing table. I'll record the elapsed time required to produce the same result with the different methods. And finally, we'll look at combining all three lookups in a single program and assess the impact on overall program run time.

The Data

Table	Rows	Columns	Size (MB)	Comments
COUNTRYCODES	17	2	0.3	Look up country names
PRODUCTS	700	4	0.3	Look up item name and manufacturer
CUSTOMERS	1,099,519	12	149.9	Look up customer names
ORDERS	14,846,132	6	952.6	The big table to be enriched

Orders Sample

OrderID	OrderDate	CustomerID	Country	ProductID	Quantity
023011_000100000043_0002	2023.01.01	000100000043	US	358	1
023011_000100000058_0014	2023.01.01	000100000058	US	632	2
023011_000100000105_0015	2023.01.01	000100000105	US	627	3

CountryCodes Sample

Country	Country_Name
AD	Andorra
AE	United Arab Emirates
AF	Afghanistan

Products Sample

ProductID	Item	MadeIn	Manufacturer
1	Toncot Beefy-T Emb T-Shirt	US	A Team Sports
2	Trekking Tent	>GB	Prime Sports Ltd
3	Indoor Handbold Special Shoes	US	3Top Sports

Customers Sample

CustomerID	Gender	FirstName	MI	LastName	StreetAddress	City	State	ZipCode	Country	Phone	Birthday
000100000001	F	Katherine	M	Thomas	3005 Brookside Drive	Guin	AL	35563	US	205-412-9602	01/01/1927
000100000002	M	Anthony	E	Feldmann	4026 Maple Lane	Huntsville	AL	35802	US	256-522-5585	01/01/1927
000100000003	M	Sam	C	Tennyson	2715 Mulberry Avenue	Donaldson	AR	71941	US	501-384-1324	01/01/19

The Scenarios

We have a large data table (lookup.orders) that we want to enrich with information from another table.

Scenario 1

The lookup.orders table contains a two-letter country code in the Country column. We want to add a new column, CountryName, that contains the full name of the country. Table lookup.CountryCodes also contains a two-letter country code (Country) and the full country name (CountryName).

Scenario 2

The lookup.orders table contains a numeric ProductID column. We want to add a new column, Product, that contains name of the item ordered and the manufacturer’s name. Table lookup.Products also contains a numeric ProductID column, and columns containing the item name (Item) and manufacturer information (Manufacturer).

Scenario 3

The lookup.orders table contains a numeric CustomerID column. We want to add a new column, Name, that contains the customer’s full name. Table lookup.Customers also contains a numeric CustomerID column, and columns containing the customers fist (FirstName), middle initial (MI), and last(LastName) name information.

Final Scenario

I'll write code to add all three enrichments to lookup.orders in a single program and measure the overall elapsed time for completion.

The Techniques

Data Step

Merge Sample Code

proc sort data=lookup.orders out=orders;
   by country;
run;
data dsmerge_countries;
   merge orders (in=ordered)
   lookup.countrycodes;
   by country;
   if ordered;
run;

Hash Object Sample Code

data dshash_countries;
    if _n_=1 then do;
       /* Get the lookup table variables into the PDV */
       if 0 then set lookup.countrycodes;
       /* Initialize the hash */
       declare hash codes(dataset:'lookup.countrycodes');
       codes.definekey('Country');
       codes.definedata('Country_Name');
       codes.defineDone();
    end;
    set lookup.orders;
    rc=codes.find();
    drop rc;
run;

Custom Formats Sample Code

 proc format cntlin=char_formats; run;
 proc format cntlin=num_formats; run;
 data dsFmt_countries;
    set lookup.orders;
    length CountryName $85;
    CountryName=put(country,$country.);
 run;

PROC SQL Left Join

proc sql;
create table sql_countries as
select o.*, Country_Name
   from 
      lookup.orders as o
      left join
      lookup.countrycodes as cc
   on o.country=cc.country
;
quit;

PROC FedSQL Left Join

 proc FedSQL;
 create table sql_countries as
 select o.*, Country_Name
    from 
       lookup.orders as o
       left join
       lookup.countrycodes as cc
    on o.country=cc.country
    ;
    quit;

The Results

Technique vs. Lookup Table Size	17 Rows (SEC)	700 Rows (SEC)	1.1M Rows (SEC)	Get All 3 Values (MM:SS)
DATA Step Merge	12.01	19.45	19.85	01:02
DATA Step Hash Object	4.75	13.63	23.59	00:20
DATA Step Formats (3.27 sec to create)	3.85	4.47	7.38	00:20
PROC SQL LEFT JOIN	5.45	11.31	13.57	00:50
PROC FedSQL LEFT JOIN	10.07	14.50	13.05	00:24

Conclusions

If the formats are already available, then formats look like the best all-around option. If the formats need to be created, the additional overhead should be considered - and the DATA step hash object looks like an attractive choice. But remember, hash object performance seems to degrade as the size of the lookup table increases, and both hash objects and formats can be memory intensive. PROC SQL performs quite reliably, but with the largest data FedSQL seems to have the edge. So, as usual, there isn't any "one size fits all" solution. When performance is critical, there's no substitute for formal benchmarking. But knowing a bit about each of these techniques can guide you to a "good enough" choice for ad hoc coding.

Do you have a favorite lookup technique for your SAS programs? Did the information presented here give you cause to rethink your choices?

Until next time, may the SAS be with you!
Mark

Grab the ZIP file containing the code and data for this series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip

Links to prior posts in this series:

Find more articles from SAS Global Enablement and Learning here.

ballardw · ‎08-08-2024

One place that I find custom INFORMAT for "look up" data when reading data is pretty powerful. Data steps reading external files with variables that should have known values but because of human behavior may not actually be the expected values use of the "other=_error_" immediately flags unexpected values with the invalid data messages in the log.

One example I use custom informats for validating location codes. The occasional typo or whatever may cause and erroneous value pops up quickly and I can contact the source for confirmation and correction of the source file. However every so often, maybe once every two or three years there is a new site created and the data analysis folks (me) aren't in on the implementation and find out from reading the data files. The invalid data messages also lets me contact the appropriate folks to get the metadata involved that the site code represents that we need for other processing to update the display Formats that use this code for reporting or processing.

SASJedi · ‎08-08-2024

@ballardw Oh, yeah!! Custom INFORMATs rock, and are underutilized by SAS programmers, IMHO. I knew how to make a custom FORMAT for years before discovering that it was just as easy to make a custom INFORMAT. Nice use case you cite there 👍

SASKiwi · ‎08-09-2024

Good to see that the 'old-fashioned' custom FORMAT does so well! I get the impression that this is often overlooked in favour of the newer hash objects. IMHO custom FORMAT lookups are easier to code (one DATA step PUT function) as well as being fast. I have examples of a DATA step doing 40 plus custom format lookups. I'd hate to do that with hash objects.

It is also worth pointing out you can do custom FORMAT lookups in PROC SQL, which is not something you can do with hash objects.

SASJedi · ‎08-12-2024

@SASKiwi In my mind, the real beauty of formats is that they can be deployed anywhere an expression is allowed - like a PROC's WHERE expression. For example, say I need to know how many t-shirts have been sold in each country by year. I could do a bunch of joins and a summary SQL query, or just this:

proc means data=lookup.orders sum maxdec=0;
	class country orderdate;
	where lowcase(put(productID,product.)) contains 't-shirt';
	var Quantity;
	format country $country. orderdate year4.;
run;

and - violá!

Analysis Variable : Quantity
Country	OrderDate	N Obs	Sum
Austria	2023	4996	15177
	2024	2519	7626
Australia	2023	4960	14908
	2024	2499	7474
United States	2023	315923	947540
	2024	157951	473777

Pav · ‎08-22-2024

Hi Mark,

In this article on lookup, I can propose a hybrid approach:

For big tables like customer and orders, the sort + datastep merge is the fasted technique and can stay below a 2Gb Memory limit.

In order to avoid multiple data step merge steps, you could include 2 hash tables to lookup country and product during the data step merge of orders and customer.

That way you get the best of 2 worlds: fast on disk lookups with data step merge for bigger tables and in memory hash tables for lookups with smaller tables.

I also pay attention to the length of the 2 variables product and name, since the function CATX defaults to a length of 200bytes, this could increase IO.

So I initialize the variables Product to $85 and Name to $50 with a length statement, just not to waste any diskspace when writing them out.

Total Elapsed Time for this on my computer is between 00:15 and 00:20 secs, same as FEDSQL and sometimes even faster.

With the option fullstimer, you can also monitor Memory usage, next to elapsed time and cpu time.

With Proc FEDSQL _method, you can see that it is using a HashJoin in the backend. Also impressive is that the FedSQL reports using limited amounts of RAM Memory (around 130Mb).

My Sort + Merge uses for sorting up to 1520Mb or RAM Memory (i have a configuration options of memsize of 8G, sortsize setting of 4G), the data step merge uses only 2Mb or Ram Memory.

The Hash Complete method only uses 103Mb of Memory and completes in 00:43secs.

Proc SQL _method shows that the lookups use 4 sorts in the backend, and can use up to 00:46secs of time when having up to 3150Mb of Memory at its disposal to perform the sorts and join merge.

In more constrained environments where less Memory is available or where disk IO is not very fast SSD disks, then the performance of this step can degrade to more than 03min.

/******************************************************************************

DATA step merge - Hash Hybrid Final

******************************************************************************/

/******************************************************************************

Preliminaries: Clear out the WORK library

******************************************************************************/

proc datasets library=work kill nolist nodetails;

run; quit;

%let StartTime=%qsysfunc(time());

proc sort data=lookup.orders out=orders;

by CustomerID;

run;

data dsmergehash_final;

if _n_=1 then do;

if 0 then do;

/* Get the lookup table variables into the PDV */

set lookup.countrycodes;

set lookup.Products(keep=ProductID Item Manufacturer);

/*set lookup.Customers(keep=CustomerID FirstName MI LastName);*/

end;

/* Initialize the CountryCodes hash */

declare hash codes(dataset:'lookup.countrycodes');

codes.definekey('Country');

codes.definedata('Country_Name');

codes.defineDone();

/* Initialize the Products hash */

declare hash hProd(dataset:'lookup.Products ');

hProd.definekey('ProductID');

hProd.definedata('Item','Manufacturer');

hProd.defineDone();

end;

merge orders (in=ordered)

lookup.Customers (keep=CustomerID FirstName MI LastName) ;

by CustomerID;

rc=codes.find();

rc=hProd.find();

length Product $85;

Product=catx(' by ',Item,Manufacturer);

length Name $50;

Name=catx('. ',catx(' ',FirstName,MI),LastName);

if ordered;

drop rc Item Manufacturer FirstName MI LastName;

run;

%let EndTime=%qsysfunc(time());

%let ElapsedTime=%qsysfunc(putn(%sysevalf(%superq(EndTime)-%superq(StartTime)),e8601tm.));

%put NOTE: &=ElapsedTime;

/******************************************************************************

Data validation

******************************************************************************/

title "DSMERGEHash_FINAL Info";

proc sql;

select memname, nlobs format=comma12., nvar

from dictionary.tables

where libname='WORK'

and memname ='DSMERGEHASH_FINAL'

;

quit;

title;

SASJedi · ‎08-22-2024

@Pav - an absolutely brilliant approach! I love the idea of combining techniques to get even faster performance. Thanks for sharing your idea with me (and the world) - this is one of the many reasons I love being a SAS programmer.

All the best,

Mark

yabwon

Hey Mark,

How about about the 53(+3) ways of doing tables look-up in SAS?

Here is the "Fifty Shades of SAS Programming, or 53 (+3) Syntax Snippets for a Table Look-up Task, or How to Lear..." article from WUSS 2024 conference, co-authored by me (@yabwon) and Quentin (@Quentin).

All the best

Bart

SASJedi

@yabwon Bart - that was an awesome presentation! IT was good seeing you at WUSS 👍