Huge data is everywhere nowadays. To make sense of it, we must often aggregate data in meaningful ways. In this post, I’ll cover several methods for producing aggregate data sets, including PROC FREQ, PROC SUMMARY, PROC TABLULATE, SQL, DATA step, and DS2. I'll experiment with using a hash object to summarize in a DATA step without having to pre-sort the data, a technique I discovered while reading a SESUG 2024 paper submission by David V. Cole of the Department of Population Medicine, Harvard Pilgrim Health Care Institute. The paper is titled "This Works on So Many Levels: Building Summary Tables in SAS® with Hash Programming". SESUG 2024 begins Sunday, 9/22/2024 in Bethesda, MD. If you are in the area, I hope to see you there!
The Desired Aggregations
Here are the aggregations we want for this data:
How many customers do we have in each country?
What is the sales volume (total number of items sold) for each country?
For each country, how many items purchased were produced domestically vs. imported?
The Aggregation Techniques
Base SAS provides several methods of performing data aggregation. Each has its own pros and cons. We’ll look at the ease of coding and overall performance of each technique in our scenario above. Our techniques will include:
PROC FREQ / SUMMARY
PROC SQL with GROUP BY
DATA step with FIRST./LAST. processing
DS2 data program with SQL query on SET and FIRST./LAST. processing
PROC FREQ/SUMMARY Aggregation
/* 1. How many customers live in each country? */
/* There are many records for each CustomerID. Use PROC SORT to deduplicate. */
proc sort data=lookup.orders (keep=CustomerID)
out=orders(rename=(customerID=Country)) nodupkey;
by customerID;
run;
/* The customer_country_name format displays the customer's country of residence */
proc freq data=orders order=freq;
table Country / noprint nopercent
out=country_customers_freq(drop=percent );
format Country customer_country_name.;
run;
proc summary data=orders N NONOBS;
class Country;
format Country customer_country_name.;
var Country;
output out=country_customers_summary(rename=(_FREQ_=Count _STAT_=Statistic)
where=(Statistic='N' and Country is not null)
drop=_TYPE_);
run;
/******************************************************************************
FREQ real time 0.80 seconds
SUMMARY real time 0.20 seconds <- The winner for speed in this example
******************************************************************************/
/* 2. Which country had the highest sales volume (number of items sold?) */
/* No need to deduplicate. */
proc freq data=lookup.orders (keep=Country) order=freq;
table Country / noprint nopercent out=country_items_freq(drop=percent);
format Country $Country_name.;
run;
/* Can't use PROC SUMMARY to aggregate non-numeric values, so must use CustomerID */
proc summary data=lookup.orders(keep=CustomerID rename=(CustomerID=Country)) N NONOBS;
class Country;
format Country customer_country_name.;
var Country;
output out=country_items_summary(rename=(_FREQ_=Count _STAT_=Statistic)
where=(Statistic='N' and Country is not null)
drop=_TYPE_);
run;
/******************************************************************************
FREQ real time 0.70 seconds <- the winner when counting character values
SUMMARY real time 1.71 seconds
******************************************************************************/
/*3. For each country, how many items purchased were produced domestically vs. imported */
proc freq data=lookup.orders (rename=(CustomerID=PointOfSale ProductID=Source));
format PointOfSale customer_country_name. Source product_country_name.;
table PointOfSale * Source/norow nocol nopercent
out=country_imports_freq(drop=percent);
run;
/* Can't use PROC SUMMARY for cross-tabulations. PROC TABLULATE is good at that,
but doesn't have an OUTPUT statement. We can use ODS OUTPUT to capture the
results for further processing */
ods output table=country_imports_tab(drop=_: rename=(N=Count));
proc tabulate data=lookup.orders (rename=(CustomerID=PointOfSale ProductID=Source))
format=comma12.;
class PointOfSale Source;
table PointOfSale, Source *N=' ';
format PointOfSale customer_country_name. Source product_country_name.;
run;
ods output close;
/******************************************************************************
FREQ real time 4.22 seconds
TABULATE real time 2.27 seconds <- Fastest results for numeric cross-tab
******************************************************************************/
PROC SQL Aggregation
/* 1. How many customers live in each country? */
/* There are many records for each CustomerID. Use SELECT DISTINCT to de-duplicate. */
proc sql;
create table country_customer_sql as
select put(CustomerID, customer_country_name.) as Country
,count(*) as Items
from
(select distinct CustomerID
from lookup.orders)
group by put(CustomerID, customer_country_name.)
;
quit;
/******************************************************************************
real time 5.39 seconds
******************************************************************************/
/* 2. Which country had the highest sales volume (number of items sold?) */
proc sql;
create table country_items_sql as
select put(CustomerID,customer_country.) as Country
,count(*) as Items
from lookup.orders
group by 1
order by 2 desc
;
quit;
/******************************************************************************
real time 2.64 seconds
******************************************************************************/
/*3. For each country, how many items purchased were produced domestically vs. imported */
proc sql;
create table country_imports_sql as
select put(o.country,$country_name.) as Country
,sum(p.MadeIn=o.Country) format=comma16. as Domestic
,sum(p.MadeIn ne o.Country) format=comma16. as Imported
from lookup.Orders as o
inner join
lookup.Products as p
on p.ProductID=o.ProductID
group by 1
;
quit;
/******************************************************************************
real time 12.1 seconds
******************************************************************************/
DATA Step Aggregation
/* 1. How many customers live in each country? */
/* There are many records for each CustomerID. Use PROC SORT to deduplicate. */
proc sort data=lookup.orders (keep=CustomerID Country)
out=orders (drop=CustomerID rename=(Country=Ctry)) nodupkey;
by Country customerid;
run;
data country_customer_DS;
length Country $50;
format Count comma16.;
keep Country Count;
set orders;
by ctry;
if first.ctry then do;
Count=0;
end;
Count+1;
if last.ctry then do;
Country=put(Ctry,$Country_name.);
output;
end;
run;
/******************************************************************************
real time 2.14 seconds (sort + DATA step)
******************************************************************************/
/* 2. Which country had the highest sales volume (number of items sold?) */
proc sort data=lookup.orders (keep=CustomerID Country)
out=orders (Drop=CustomerID);
by Country CustomerID;
run;
data country_items_DS;
Length Country $50;
format Items comma16.;
keep Country Items;
set orders;
by Country;
if first.country then do;
Items=0;
end;
Items+1;
if last.country then do;
Country=put(country,$Country_name.);
output;
end;
run;
/******************************************************************************
real time 3.26 seconds (sort+DATA step)
******************************************************************************/
/******************************************************************************
Alternatively, you can summarize without pre-sorting using a hash object
Hat tip to David V. Cole of the Department of Population Medicine,
Harvard Pilgrim Health Care Institute and his excellent SESUG2024 paper "This
Works on So Many Levels: Building Summary Tables in SAS® with Hash Programming"
******************************************************************************/
data _null_;
*Set up PDV variables;
length Country $50 Count 8;
format Count comma16.;
/* Set up the hash.*/
dcl hash hOrd(ordered:'Y');
/* Hash key is the BY variable */
hOrd.definekey( 'Country');
/* Set the BY variable and cournter variable as data */
hOrd.definedata('Country','Count');
hOrd.definedone();
* Use a DO UNTIL loop to process the input dataset;
do until(eof);
/* Read a row from theinput table */
set lookup.orders end=eof;
call missing(count);
/* Format the BY variable for pretty output */
Country=put(country,$Country_name.);
/* Look up BY variable value*/
rc=hOrd.find();
/* Increment count */
count=sum(count,1);
/* Replace the hash key value to update running total */
rc=hOrd.replace();
end;
/* All done - write the hash contents to the output table */
hOrd.output(dataset:'country_items_DS_hash');
run;
/******************************************************************************
real time 4.45 seconds
******************************************************************************/
/*3. For each country, how many items purchased were produced domestically vs. imported */
proc sort data=lookup.orders (keep=Country ProductID)
out=orders;
by Country;
run;
data country_imports_DS;
length Country $50;
format Domestic Imported comma16.;
retain Domestic Imported;
keep Country Domestic Imported;
set orders;
by Country;
if first.country then do;
Domestic=0;
Imported=0;
end;
if country=put(ProductID,Product_country.) then Domestic+1;
else Imported+1;
if last.country then do;
Country=put(country,$Country_name.);
output;
end;
run;
/******************************************************************************
real time 3.84 seconds (sort+DATA step)
******************************************************************************/
DS2 Data Program Aggregation
/* 1. How many customers live in each country? */
/* There are many records for each CustomerID. Use SELECT DISTINCT to de-duplicate. */
proc ds2;
data country_customer_ds2 /overwrite=yes;
dcl char(50) Country ;
dcl double "Count" having format comma16.;
retain Country;
keep Country "Count";
method run();
set{select distinct CustomerID, country as ctry from lookup.orders order by country};
by ctry;
if first.ctry then do;
Country=put(Ctry,$Country_name.);
"Count"=0;
end;
"Count"+1;
if last.ctry then do;
output;
end;
end;
enddata;
run;
quit;
/******************************************************************************
real time 2.87 seconds
******************************************************************************/
/* 2. Which country had the highest sales volume (number of items sold?) */
proc ds2;
data country_items_ds2 /overwrite=yes;
dcl char(50) Country ;
dcl double "Items" having format comma16.;
retain Country;
keep Country "Items";
method run();
set lookup.orders (keep=(Country));
by Country;
if first.country then do;
"Items"=0;
end;
"Items"+1;
if last.country then do;
Country=put(country,$Country_name.);
output;
end;
end;
enddata;
run;
quit;
/******************************************************************************
real time 8.74 seconds
******************************************************************************/
/*3. For each country, how many items purchased were produced domestically vs. imported */
proc ds2;
data country_imports_ds2 /overwrite=yes;
dcl char(50) Country ;
dcl double Domestic Imported having format comma16.;
retain Country Domestic Imported;
keep Country Domestic Imported;
method run();
set lookup.orders (keep=(Country ProductID));
by Country;
if first.country then do;
Domestic=0;
Imported=0;
end;
if country=put(ProductID,Product_country.) then Domestic+1;
else Imported+1;
if last.country then do;
Country=put(country,$Country_name.);
output;
end;
end;
enddata;
run;
quit;
/******************************************************************************
real time 12.3 seconds
******************************************************************************/
Conclusions
There are multiple ways to aggregate data in base SAS. Depending on your data, system architecture, and the complexity of the aggregation, run times can vary significantly.
PROC FREQ
PROC SUMMARY
PROC TABLULATE
SQL
DATA Step
Hash Object
DS2
Aggregation 1
2.14
2.06
3.24
2.14
2.87
Aggregation 2
0.7
1.71
4.69
3.26
4.45
8.74
Aggregation 3
4.22
2.27
11.94
3.84
12.4
For my own coding, if the aggregation is simple, I tend to use SQL for the versatility and simplicity of coding, and performance is generally pretty good. If performance is my main concern, I’ll consider using the PROCS – FREQ, SUMMARY, or TABLULATE. Whenever the aggregation gets complex, or where I need to use my own aggregation logic, I use the DATA step. In any case, it pays to know about several different techniques. A little experimentation can determine which provides the best result for your aggregation.
What's your favorite "go to" for aggregation in SAS? Did you see anything new in this post that you would like to try?
Until next time, may the SAS be with you!
Mark
References:
SESUG Paper 81-2024 - This Works on So Many Levels: Building Summary Tables in SAS® with Hash Programming - David V. Cole, Harvard Pilgrim Health Care Institute
Grab the ZIP file containing the code and data for this series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip
Links to prior posts in this series:
Part 1 – Append
Part 2 – Sort
Part 3 – Deduplicate
Part 4 – Lookup
Find more articles from SAS Global Enablement and Learning here.
... View more