BookmarkSubscribeRSS Feed

Manipulating Data in Base SAS® Part 4 – Lookup

Started ‎08-08-2024 by
Modified ‎08-13-2024 by
Views 1,498

Jordan_Mark_01_02.jpg There are several different ways to do lookups in SAS. In this post, I’ll look at techniques that can produce results similar to a left join of a big table with a small table, like might be done to enrich a transaction table with additional information. The techniques covered will include PROC SQL and FedSQL joins, DATA step and DS2 merges, hash objects, and custom formats. I'll execute code for each technique several times, using a small, medium, and large lookup table to enrich the data a very large existing table. I'll record the elapsed time required to produce the same result with the different methods. And finally, we'll look at combining all three lookups in a single program and assess the impact on overall program run time.

 

 

 

 

The Data

 

Table Rows Columns Size (MB) Comments
COUNTRYCODES 17 2 0.3 Look up country names
PRODUCTS 700 4 0.3 Look up item name and manufacturer
CUSTOMERS 1,099,519 12 149.9 Look up customer names
ORDERS 14,846,132 6 952.6 The big table to be enriched

 

Orders Sample

 

OrderID OrderDate CustomerID Country ProductID Quantity
023011_000100000043_0002 2023.01.01 000100000043 US 358 1
023011_000100000058_0014 2023.01.01 000100000058 US 632 2
023011_000100000105_0015 2023.01.01 000100000105 US 627 3

 

CountryCodes Sample

 

Country Country_Name
AD Andorra
AE United Arab Emirates
AF Afghanistan

 

Products Sample

 

ProductID Item MadeIn Manufacturer
1 Toncot Beefy-T Emb T-Shirt US A Team Sports
2 Trekking Tent >GB Prime Sports Ltd
3 Indoor Handbold Special Shoes US 3Top Sports

 

Customers Sample

 

CustomerID Gender FirstName MI LastName StreetAddress City State ZipCode Country Phone Birthday
000100000001 F Katherine M Thomas 3005 Brookside Drive Guin AL 35563 US 205-412-9602 01/01/1927
000100000002 M Anthony E Feldmann 4026 Maple Lane Huntsville AL 35802 US 256-522-5585 01/01/1927
000100000003 M Sam C Tennyson 2715 Mulberry Avenue Donaldson AR 71941 US 501-384-1324 01/01/19

 

 

The Scenarios

 

We have a large data table (lookup.orders) that we want to enrich with information from another table.

 

Scenario 1

 

The lookup.orders table contains a two-letter country code in the Country column. We want to add a new column, CountryName, that contains the full name of the country. Table lookup.CountryCodes also contains a two-letter country code (Country) and the full country name (CountryName).

 

Scenario 2

 

The lookup.orders table contains a numeric ProductID column. We want to add a new column, Product, that contains name of the item ordered and the manufacturer’s name. Table lookup.Products also contains a numeric ProductID column, and columns containing the item name (Item) and manufacturer information (Manufacturer).

 

Scenario 3

 

The lookup.orders table contains a numeric CustomerID column. We want to add a new column, Name, that contains the customer’s full name. Table lookup.Customers also contains a numeric CustomerID column, and columns containing the customers fist (FirstName), middle initial (MI), and last(LastName) name information.

 

Final Scenario

 

I'll write code to add all three enrichments to lookup.orders in a single program and measure the overall elapsed time for completion.

 

 

The Techniques

 

Data Step

 

  • Merge Sample Code

 

proc sort data=lookup.orders out=orders;
   by country;
run;
data dsmerge_countries;
   merge orders (in=ordered)
   lookup.countrycodes;
   by country;
   if ordered;
run;

 

  • Hash Object Sample Code

 

data dshash_countries;
    if _n_=1 then do;
       /* Get the lookup table variables into the PDV */
       if 0 then set lookup.countrycodes;
       /* Initialize the hash */
       declare hash codes(dataset:'lookup.countrycodes');
       codes.definekey('Country');
       codes.definedata('Country_Name');
       codes.defineDone();
    end;
    set lookup.orders;
    rc=codes.find();
    drop rc;
run;

 

  • Custom Formats Sample Code

 

 proc format cntlin=char_formats; run;
 proc format cntlin=num_formats; run;
 data dsFmt_countries;
    set lookup.orders;
    length CountryName $85;
    CountryName=put(country,$country.);
 run;

 

PROC SQL Left Join

 

proc sql;
create table sql_countries as
select o.*, Country_Name
   from 
      lookup.orders as o
      left join
      lookup.countrycodes as cc
   on o.country=cc.country
;
quit;

 

PROC FedSQL Left Join  

 

 proc FedSQL;
 create table sql_countries as
 select o.*, Country_Name
    from 
       lookup.orders as o
       left join
       lookup.countrycodes as cc
    on o.country=cc.country
    ;
    quit;

 

 

The Results 

 

Technique vs. Lookup Table Size 17 Rows (SEC) 700 Rows (SEC) 1.1M Rows (SEC) Get All 3 Values (MM:SS)
DATA Step Merge 12.01 19.45 19.85 01:02
DATA Step Hash Object 4.75 13.63 23.59 00:20
DATA Step Formats (3.27 sec to create) 3.85 4.47 7.38 00:20
PROC SQL LEFT JOIN 5.45 11.31 13.57 00:50
PROC FedSQL LEFT JOIN 10.07 14.50 13.05 00:24

 

 

Conclusions 

 

If the formats are already available, then formats look like the best all-around option. If the formats need to be created, the additional overhead should be considered - and the DATA step hash object looks like an attractive choice. But remember, hash object performance seems to degrade as the size of the lookup table increases, and both hash objects and formats can be memory intensive. PROC SQL performs quite reliably, but with the largest data FedSQL seems to have the edge. So, as usual, there isn't any "one size fits all" solution. When performance is critical, there's no substitute for formal benchmarking. But knowing a bit about each of these techniques can guide you to a "good enough" choice for ad hoc coding.

 

Do you have a favorite lookup technique for your SAS programs? Did the information presented here give you cause to rethink your choices?

Jordan_Mark_01_02.jpg
Until next time, may the SAS be with you!
Mark
 


Grab the ZIP file containing the code and data for this series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip

 

Links to prior posts in this series:

 

 

 

Find more articles from SAS Global Enablement and Learning here.

Comments

One place that I find custom INFORMAT for "look up" data when reading data is pretty powerful. Data steps reading external files with variables that should have known values but because of human behavior may not actually be the expected values use of the "other=_error_" immediately flags unexpected values with the invalid data messages in the log.

 

One example I use custom informats for validating location codes. The occasional typo or whatever may cause and erroneous value pops up quickly and I can contact the source for confirmation and correction of the source file. However every so often, maybe once every two or three years there is a new site created and the data analysis folks (me) aren't in on the implementation and find out from reading the data files. The invalid data messages also lets me contact the appropriate folks to get the metadata involved that the site code represents that we need for other processing to update the display Formats that use this code for reporting or processing.

@ballardw Oh, yeah!! Custom INFORMATs rock, and are underutilized by SAS programmers, IMHO.  I knew how to make a custom FORMAT for years before discovering that it was just as easy to make a custom INFORMAT.  Nice use case you cite there 👍

Good to see that the 'old-fashioned' custom FORMAT does so well! I get the impression that this is often overlooked in favour of the newer hash objects. IMHO custom FORMAT lookups are easier to code (one DATA step PUT function) as well as being fast. I have examples of a DATA step doing 40 plus custom format lookups. I'd hate to do that with hash objects.

 

It is also worth pointing out you can do custom FORMAT lookups in PROC SQL, which is not something you can do with hash objects.

@SASKiwi In my mind, the real beauty of formats is that they can be deployed anywhere an expression is allowed - like a PROC's WHERE expression. For example, say I need to know how many t-shirts have been sold in each country by year. I could do a bunch of joins and a summary SQL query, or just this:

proc means data=lookup.orders sum maxdec=0;
	class country orderdate;
	where lowcase(put(productID,product.)) contains 't-shirt';
	var Quantity;
	format country $country. orderdate year4.;
run;

 and  - violá!

 

Analysis Variable : Quantity
Country OrderDate N Obs Sum
Austria 2023 4996 15177
  2024 2519 7626
Australia 2023 4960 14908
  2024 2499 7474
United States 2023 315923 947540
  2024 157951 473777

 

Pav

Hi Mark, 

In this article on lookup, I can propose a hybrid approach: 

For big tables like customer and orders, the sort + datastep merge is the fasted technique and can stay below a 2Gb Memory limit.

In order to avoid multiple data step merge steps, you could  include 2 hash tables to lookup country and product during the data step merge of orders and customer. 

That way you get the best of 2 worlds: fast on disk lookups with data step merge for bigger tables and in memory hash tables for lookups with smaller tables. 

I also pay attention to the length of the 2 variables product and name, since the function CATX defaults to a length of 200bytes, this could increase IO. 

So I initialize the variables Product to $85 and Name to $50 with a length statement, just not to waste any diskspace when writing them out. 

Total Elapsed Time for this on my computer is between 00:15 and 00:20 secs, same as FEDSQL and sometimes even faster. 

With the option fullstimer, you can also monitor Memory usage, next to elapsed time and cpu time. 

With Proc FEDSQL _method, you can see that it is using a HashJoin in the backend. Also impressive is that the FedSQL reports using limited amounts of RAM Memory (around 130Mb).  

My Sort + Merge uses for sorting up to 1520Mb or RAM Memory (i have a configuration options of memsize of 8G, sortsize setting of 4G), the data step merge uses only 2Mb or Ram Memory. 

The Hash Complete method only uses 103Mb of Memory and completes in 00:43secs. 

Proc SQL _method shows that the lookups use 4 sorts in the backend, and can use up to 00:46secs of time when having up to 3150Mb of Memory at its disposal to perform the sorts and join merge.

In more constrained environments where less Memory is available or where disk IO is not very fast SSD disks, then the performance of this step can degrade to more than 03min. 

/****************************************************************************** 
 DATA step merge - Hash Hybrid Final
******************************************************************************/
/****************************************************************************** 
 Preliminaries: Clear out the WORK library
******************************************************************************/
proc datasets library=work kill nolist nodetails;
run; quit;
 
%let StartTime=%qsysfunc(time());
 
 
proc sort data=lookup.orders out=orders;
by CustomerID;
run;
 
data dsmergehash_final;
if _n_=1 then do;
   if 0 then do;
   /* Get the lookup table variables into the PDV */
set lookup.countrycodes;
    set lookup.Products(keep=ProductID Item Manufacturer);
     /*set lookup.Customers(keep=CustomerID FirstName MI LastName);*/
end;
/* Initialize the CountryCodes hash */
declare hash codes(dataset:'lookup.countrycodes');
codes.definekey('Country');
codes.definedata('Country_Name');
codes.defineDone();
/* Initialize the Products hash */
declare hash hProd(dataset:'lookup.Products ');
hProd.definekey('ProductID');
hProd.definedata('Item','Manufacturer');
hProd.defineDone();
end;
merge orders (in=ordered)
lookup.Customers (keep=CustomerID FirstName MI LastName) ;
by CustomerID;
rc=codes.find();
rc=hProd.find();
length Product $85; 
Product=catx(' by ',Item,Manufacturer);
length Name $50;
Name=catx('. ',catx(' ',FirstName,MI),LastName);
if ordered;
drop rc Item Manufacturer FirstName MI LastName;
run;
 
%let EndTime=%qsysfunc(time());
%let ElapsedTime=%qsysfunc(putn(%sysevalf(%superq(EndTime)-%superq(StartTime)),e8601tm.));
%put NOTE: &=ElapsedTime;
/****************************************************************************** 
Data validation
******************************************************************************/
title "DSMERGEHash_FINAL Info";
proc sql;
select memname, nlobs format=comma12., nvar 
from dictionary.tables
where libname='WORK'
  and memname ='DSMERGEHASH_FINAL'
;
quit;
title;

@Pav - an absolutely brilliant approach! I love the idea of combining techniques to get even faster performance. Thanks for sharing your idea with me (and the world) - this is one of the many reasons I love being a SAS programmer.

All the best,

Mark

 

Hey Mark,

 

How about about the 53(+3) ways of doing tables look-up in SAS?

Here is the "Fifty Shades of SAS Programming, or 53 (+3) Syntax Snippets for a Table Look-up Task, or How to Lear..." article  from WUSS 2024 conference, co-authored by me (@yabwon) and Quentin (@Quentin).

 

All the best

Bart

@yabwon Bart - that was an awesome presentation! IT was good seeing you at WUSS 👍 

Version history
Last update:
‎08-13-2024 04:23 PM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags