There are several different ways to do lookups in SAS. In this post, I’ll look at techniques that can produce results similar to a left join of a big table with a small table, like might be done to enrich a transaction table with additional information. The techniques covered will include PROC SQL and FedSQL joins, DATA step and DS2 merges, hash objects, and custom formats. I'll execute code for each technique several times, using a small, medium, and large lookup table to enrich the data a very large existing table. I'll record the elapsed time required to produce the same result with the different methods. And finally, we'll look at combining all three lookups in a single program and assess the impact on overall program run time.
Table | Rows | Columns | Size (MB) | Comments |
COUNTRYCODES | 17 | 2 | 0.3 | Look up country names |
PRODUCTS | 700 | 4 | 0.3 | Look up item name and manufacturer |
CUSTOMERS | 1,099,519 | 12 | 149.9 | Look up customer names |
ORDERS | 14,846,132 | 6 | 952.6 | The big table to be enriched |
OrderID | OrderDate | CustomerID | Country | ProductID | Quantity |
023011_000100000043_0002 | 2023.01.01 | 000100000043 | US | 358 | 1 |
023011_000100000058_0014 | 2023.01.01 | 000100000058 | US | 632 | 2 |
023011_000100000105_0015 | 2023.01.01 | 000100000105 | US | 627 | 3 |
Country | Country_Name |
AD | Andorra |
AE | United Arab Emirates |
AF | Afghanistan |
ProductID | Item | MadeIn | Manufacturer |
1 | Toncot Beefy-T Emb T-Shirt | US | A Team Sports |
2 | Trekking Tent | >GB | Prime Sports Ltd |
3 | Indoor Handbold Special Shoes | US | 3Top Sports |
CustomerID | Gender | FirstName | MI | LastName | StreetAddress | City | State | ZipCode | Country | Phone | Birthday |
000100000001 | F | Katherine | M | Thomas | 3005 Brookside Drive | Guin | AL | 35563 | US | 205-412-9602 | 01/01/1927 |
000100000002 | M | Anthony | E | Feldmann | 4026 Maple Lane | Huntsville | AL | 35802 | US | 256-522-5585 | 01/01/1927 |
000100000003 | M | Sam | C | Tennyson | 2715 Mulberry Avenue | Donaldson | AR | 71941 | US | 501-384-1324 | 01/01/19 |
We have a large data table (lookup.orders) that we want to enrich with information from another table.
The lookup.orders table contains a two-letter country code in the Country column. We want to add a new column, CountryName, that contains the full name of the country. Table lookup.CountryCodes also contains a two-letter country code (Country) and the full country name (CountryName).
The lookup.orders table contains a numeric ProductID column. We want to add a new column, Product, that contains name of the item ordered and the manufacturer’s name. Table lookup.Products also contains a numeric ProductID column, and columns containing the item name (Item) and manufacturer information (Manufacturer).
The lookup.orders table contains a numeric CustomerID column. We want to add a new column, Name, that contains the customer’s full name. Table lookup.Customers also contains a numeric CustomerID column, and columns containing the customers fist (FirstName), middle initial (MI), and last(LastName) name information.
I'll write code to add all three enrichments to lookup.orders in a single program and measure the overall elapsed time for completion.
proc sort data=lookup.orders out=orders;
by country;
run;
data dsmerge_countries;
merge orders (in=ordered)
lookup.countrycodes;
by country;
if ordered;
run;
data dshash_countries;
if _n_=1 then do;
/* Get the lookup table variables into the PDV */
if 0 then set lookup.countrycodes;
/* Initialize the hash */
declare hash codes(dataset:'lookup.countrycodes');
codes.definekey('Country');
codes.definedata('Country_Name');
codes.defineDone();
end;
set lookup.orders;
rc=codes.find();
drop rc;
run;
proc format cntlin=char_formats; run;
proc format cntlin=num_formats; run;
data dsFmt_countries;
set lookup.orders;
length CountryName $85;
CountryName=put(country,$country.);
run;
proc sql;
create table sql_countries as
select o.*, Country_Name
from
lookup.orders as o
left join
lookup.countrycodes as cc
on o.country=cc.country
;
quit;
proc FedSQL;
create table sql_countries as
select o.*, Country_Name
from
lookup.orders as o
left join
lookup.countrycodes as cc
on o.country=cc.country
;
quit;
Technique vs. Lookup Table Size | 17 Rows (SEC) | 700 Rows (SEC) | 1.1M Rows (SEC) | Get All 3 Values (MM:SS) |
DATA Step Merge | 12.01 | 19.45 | 19.85 | 01:02 |
DATA Step Hash Object | 4.75 | 13.63 | 23.59 | 00:20 |
DATA Step Formats (3.27 sec to create) | 3.85 | 4.47 | 7.38 | 00:20 |
PROC SQL LEFT JOIN | 5.45 | 11.31 | 13.57 | 00:50 |
PROC FedSQL LEFT JOIN | 10.07 | 14.50 | 13.05 | 00:24 |
If the formats are already available, then formats look like the best all-around option. If the formats need to be created, the additional overhead should be considered - and the DATA step hash object looks like an attractive choice. But remember, hash object performance seems to degrade as the size of the lookup table increases, and both hash objects and formats can be memory intensive. PROC SQL performs quite reliably, but with the largest data FedSQL seems to have the edge. So, as usual, there isn't any "one size fits all" solution. When performance is critical, there's no substitute for formal benchmarking. But knowing a bit about each of these techniques can guide you to a "good enough" choice for ad hoc coding.
Do you have a favorite lookup technique for your SAS programs? Did the information presented here give you cause to rethink your choices?
Until next time, may the SAS be with you!
Mark
Grab the ZIP file containing the code and data for this series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip
Links to prior posts in this series:
Find more articles from SAS Global Enablement and Learning here.
One place that I find custom INFORMAT for "look up" data when reading data is pretty powerful. Data steps reading external files with variables that should have known values but because of human behavior may not actually be the expected values use of the "other=_error_" immediately flags unexpected values with the invalid data messages in the log.
One example I use custom informats for validating location codes. The occasional typo or whatever may cause and erroneous value pops up quickly and I can contact the source for confirmation and correction of the source file. However every so often, maybe once every two or three years there is a new site created and the data analysis folks (me) aren't in on the implementation and find out from reading the data files. The invalid data messages also lets me contact the appropriate folks to get the metadata involved that the site code represents that we need for other processing to update the display Formats that use this code for reporting or processing.
@ballardw Oh, yeah!! Custom INFORMATs rock, and are underutilized by SAS programmers, IMHO. I knew how to make a custom FORMAT for years before discovering that it was just as easy to make a custom INFORMAT. Nice use case you cite there 👍
Good to see that the 'old-fashioned' custom FORMAT does so well! I get the impression that this is often overlooked in favour of the newer hash objects. IMHO custom FORMAT lookups are easier to code (one DATA step PUT function) as well as being fast. I have examples of a DATA step doing 40 plus custom format lookups. I'd hate to do that with hash objects.
It is also worth pointing out you can do custom FORMAT lookups in PROC SQL, which is not something you can do with hash objects.
@SASKiwi In my mind, the real beauty of formats is that they can be deployed anywhere an expression is allowed - like a PROC's WHERE expression. For example, say I need to know how many t-shirts have been sold in each country by year. I could do a bunch of joins and a summary SQL query, or just this:
proc means data=lookup.orders sum maxdec=0;
class country orderdate;
where lowcase(put(productID,product.)) contains 't-shirt';
var Quantity;
format country $country. orderdate year4.;
run;
and - violá!
Analysis Variable : Quantity | |||
---|---|---|---|
Country | OrderDate | N Obs | Sum |
Austria | 2023 | 4996 | 15177 |
2024 | 2519 | 7626 | |
Australia | 2023 | 4960 | 14908 |
2024 | 2499 | 7474 | |
United States | 2023 | 315923 | 947540 |
2024 | 157951 | 473777 |
Hi Mark,
In this article on lookup, I can propose a hybrid approach:
For big tables like customer and orders, the sort + datastep merge is the fasted technique and can stay below a 2Gb Memory limit.
In order to avoid multiple data step merge steps, you could include 2 hash tables to lookup country and product during the data step merge of orders and customer.
That way you get the best of 2 worlds: fast on disk lookups with data step merge for bigger tables and in memory hash tables for lookups with smaller tables.
I also pay attention to the length of the 2 variables product and name, since the function CATX defaults to a length of 200bytes, this could increase IO.
So I initialize the variables Product to $85 and Name to $50 with a length statement, just not to waste any diskspace when writing them out.
Total Elapsed Time for this on my computer is between 00:15 and 00:20 secs, same as FEDSQL and sometimes even faster.
With the option fullstimer, you can also monitor Memory usage, next to elapsed time and cpu time.
With Proc FEDSQL _method, you can see that it is using a HashJoin in the backend. Also impressive is that the FedSQL reports using limited amounts of RAM Memory (around 130Mb).
My Sort + Merge uses for sorting up to 1520Mb or RAM Memory (i have a configuration options of memsize of 8G, sortsize setting of 4G), the data step merge uses only 2Mb or Ram Memory.
The Hash Complete method only uses 103Mb of Memory and completes in 00:43secs.
Proc SQL _method shows that the lookups use 4 sorts in the backend, and can use up to 00:46secs of time when having up to 3150Mb of Memory at its disposal to perform the sorts and join merge.
In more constrained environments where less Memory is available or where disk IO is not very fast SSD disks, then the performance of this step can degrade to more than 03min.
@Pav - an absolutely brilliant approach! I love the idea of combining techniques to get even faster performance. Thanks for sharing your idea with me (and the world) - this is one of the many reasons I love being a SAS programmer.
All the best,
Mark
Hey Mark,
How about about the 53(+3) ways of doing tables look-up in SAS?
Here is the "Fifty Shades of SAS Programming, or 53 (+3) Syntax Snippets for a Table Look-up Task, or How to Lear..." article from WUSS 2024 conference, co-authored by me (@yabwon) and Quentin (@Quentin).
All the best
Bart
@yabwon Bart - that was an awesome presentation! IT was good seeing you at WUSS 👍
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.