BookmarkSubscribeRSS Feed

Manipulating Data in Base SAS® Part 5 – Aggregate

Started 2 weeks ago by
Modified 2 weeks ago by
Views 510

Jordan_Mark_01_02.jpg Huge data is everywhere nowadays. To make sense of it, we must often aggregate data in meaningful ways. In this post, I’ll cover several methods for producing aggregate data sets, including PROC FREQ, PROC SUMMARY, PROC TABLULATE, SQL, DATA step, and DS2. I'll experiment with using a hash object to summarize in a DATA step without having to pre-sort the data, a technique I discovered while reading a SESUG 2024 paper submission by David V. Cole of the Department of Population Medicine, Harvard Pilgrim Health Care Institute. The paper is titled "This Works on So Many Levels: Building Summary Tables in SAS® with Hash Programming". SESUG 2024 begins Sunday, 9/22/2024 in Bethesda, MD. If you are in the area, I hope to see you there!

 

 

The Desired Aggregations

Here are the aggregations we want for this data:

 

  1. How many customers do we have in each country?
  2. What is the sales volume (total number of items sold) for each country?
  3. For each country, how many items purchased were produced domestically vs. imported?

 

The Aggregation Techniques

 

Base SAS provides several methods of performing data aggregation. Each has its own pros and cons. We’ll look at the ease of coding and overall performance of each technique in our scenario above. Our techniques will include:

 

  • PROC FREQ / SUMMARY
  • PROC SQL with GROUP BY
  • DATA step with FIRST./LAST. processing
  • DS2 data program with SQL query on SET and FIRST./LAST. processing

 

PROC FREQ/SUMMARY Aggregation

 

 /* 1. How many customers live in each country? */
/* There are many records for each CustomerID. Use PROC SORT to deduplicate. */
proc sort data=lookup.orders (keep=CustomerID) 
           out=orders(rename=(customerID=Country)) nodupkey;
	by customerID;
run;

/* The customer_country_name format displays the customer's country of residence */
proc freq data=orders order=freq;
   table Country / noprint nopercent 
                      out=country_customers_freq(drop=percent );
	format Country customer_country_name.;
run;
proc summary data=orders N NONOBS;
	class Country;
	format Country customer_country_name.;
	var Country;
	output out=country_customers_summary(rename=(_FREQ_=Count _STAT_=Statistic)
                                     where=(Statistic='N' and Country is not null) 
                                     drop=_TYPE_);
run;

/******************************************************************************
FREQ    real time 0.80 seconds
SUMMARY real time 0.20 seconds <- The winner for speed in this example
******************************************************************************/

/* 2. Which country had the highest sales volume (number of items sold?) */
/* No need to deduplicate. */
proc freq data=lookup.orders (keep=Country) order=freq;
   table Country / noprint nopercent out=country_items_freq(drop=percent);
	format Country $Country_name.;
run;

/* Can't use PROC SUMMARY to aggregate non-numeric values, so must use CustomerID */
proc summary data=lookup.orders(keep=CustomerID rename=(CustomerID=Country))  N NONOBS;
	class Country;
	format Country customer_country_name.;
	var Country;
	output out=country_items_summary(rename=(_FREQ_=Count _STAT_=Statistic)
                                    where=(Statistic='N' and Country is not null) 
                                    drop=_TYPE_);
run;

/******************************************************************************
FREQ    real time 0.70 seconds <- the winner when counting character values
SUMMARY real time 1.71 seconds
******************************************************************************/

/*3.	For each country, how many items purchased were produced domestically vs. imported */
proc freq data=lookup.orders (rename=(CustomerID=PointOfSale ProductID=Source));
	format PointOfSale customer_country_name. Source product_country_name.;
	table  PointOfSale * Source/norow nocol nopercent 
          out=country_imports_freq(drop=percent);
run;

/* Can't use PROC SUMMARY for cross-tabulations. PROC TABLULATE is good at that, 
   but doesn't have an OUTPUT statement. We can use ODS OUTPUT to capture the
   results for further processing */

ods output table=country_imports_tab(drop=_: rename=(N=Count));
proc tabulate data=lookup.orders (rename=(CustomerID=PointOfSale ProductID=Source))
              format=comma12.;
   class PointOfSale Source;
   table PointOfSale, Source *N=' ';
	format PointOfSale customer_country_name. Source product_country_name.;
run;
ods output close;
/******************************************************************************
FREQ     real time 4.22 seconds
TABULATE real time 2.27 seconds <- Fastest results for numeric cross-tab
******************************************************************************/ 

 

PROC SQL Aggregation

 

/* 1. How many customers live in each country? */
/* There are many records for each CustomerID. Use SELECT DISTINCT to de-duplicate. */
proc sql;
create table country_customer_sql as
select put(CustomerID, customer_country_name.) as Country 
      ,count(*) as Items
	from 
	(select distinct CustomerID
     from lookup.orders)
	group by put(CustomerID, customer_country_name.)
;
quit;
/******************************************************************************
 real time 5.39 seconds
******************************************************************************/

/* 2. Which country had the highest sales volume (number of items sold?) */
proc sql;
create table country_items_sql as
select put(CustomerID,customer_country.) as Country
      ,count(*) as Items
	from lookup.orders
	group by 1
	order by 2 desc
;
quit;
/******************************************************************************
 real time 2.64 seconds
******************************************************************************/

/*3.	For each country, how many items purchased were produced domestically vs. imported */
proc sql;
create table country_imports_sql as
select put(o.country,$country_name.) as Country
      ,sum(p.MadeIn=o.Country) format=comma16. as Domestic
      ,sum(p.MadeIn ne o.Country) format=comma16. as Imported
	from lookup.Orders as o
		inner join 
		  lookup.Products as p
		on p.ProductID=o.ProductID
	group by 1
;
quit;

/******************************************************************************
 real time 12.1 seconds
******************************************************************************/

 

DATA Step Aggregation

 

/* 1. How many customers live in each country? */
/* There are many records for each CustomerID. Use PROC SORT to deduplicate. */
proc sort data=lookup.orders (keep=CustomerID Country)
          out=orders (drop=CustomerID rename=(Country=Ctry)) nodupkey;
	by Country customerid;
run;

data country_customer_DS;
   length Country $50;
	format Count comma16.; 
	keep Country Count;
	set orders;
	by ctry;
	if first.ctry then do;
		Count=0;
	end;
	Count+1;
	if last.ctry then do;
		Country=put(Ctry,$Country_name.);
		output;
	end;
run;
/******************************************************************************
 real time 2.14 seconds (sort + DATA step)
******************************************************************************/

/* 2. Which country had the highest sales volume (number of items sold?) */
proc sort data=lookup.orders (keep=CustomerID Country)
          out=orders (Drop=CustomerID);
	by Country CustomerID;
run;

data country_items_DS;
   Length Country $50;
   format Items comma16.; 
	keep Country Items;
	set orders;
	by Country;
	if first.country then do;
		Items=0;
	end;
	Items+1;
	if last.country then do;
		Country=put(country,$Country_name.);
		output;
	end;
run;
/******************************************************************************
 real time 3.26 seconds (sort+DATA step)
******************************************************************************/

/******************************************************************************
  Alternatively, you can summarize without pre-sorting using a hash object
 
  Hat tip to David V. Cole of the Department of Population Medicine, 
  Harvard Pilgrim Health Care Institute and his excellent SESUG2024 paper "This 
  Works on So Many Levels: Building Summary Tables in SAS® with Hash Programming"
******************************************************************************/
data _null_;
	*Set up PDV variables;
	length Country $50 Count 8;
	format Count comma16.;
	/* Set up the hash.*/	
	dcl hash hOrd(ordered:'Y');
	/* Hash key is the BY variable */	
	hOrd.definekey( 'Country');
	/* Set the BY variable and cournter variable as data */	
	hOrd.definedata('Country','Count');
	hOrd.definedone();
	* Use a DO UNTIL loop to process the input dataset;
	do until(eof);
		/* Read a row from theinput table */
		set lookup.orders end=eof;
		call missing(count);
		/* Format the BY variable for pretty output */
		Country=put(country,$Country_name.);
		/* Look up BY variable value*/
		rc=hOrd.find();
		/* Increment count */
		count=sum(count,1);
		/* Replace the hash key value to update running total */
		rc=hOrd.replace();
	end;
	/* All done - write the hash contents to the output table */
	hOrd.output(dataset:'country_items_DS_hash');
run;
/******************************************************************************
 real time 4.45 seconds
******************************************************************************/

/*3.	For each country, how many items purchased were produced domestically vs. imported */
proc sort data=lookup.orders (keep=Country ProductID)
          out=orders;
	by Country;
run;

data country_imports_DS;
   length Country $50;
   format Domestic Imported comma16.; 
	retain Domestic Imported;
	keep Country Domestic Imported;
	set orders;
	by Country;
	if first.country then do;
			Domestic=0;
			Imported=0;
	end;
	if country=put(ProductID,Product_country.) then Domestic+1;
		else Imported+1;
	if last.country then do;
		Country=put(country,$Country_name.);
		output;
	end;
run;
/******************************************************************************
 real time 3.84 seconds (sort+DATA step)
******************************************************************************/

 

DS2 Data Program Aggregation

 

/* 1. How many customers live in each country? */
/* There are many records for each CustomerID. Use SELECT DISTINCT to de-duplicate. */
proc ds2;
data country_customer_ds2 /overwrite=yes;
   dcl char(50) Country ;
	dcl double "Count" having format comma16.; 
	retain Country;
	keep Country "Count";
	method run();
		set{select distinct CustomerID, country as ctry from lookup.orders order by country};
		by ctry;
		if first.ctry then do;
			Country=put(Ctry,$Country_name.);
			"Count"=0;
		end;
		"Count"+1;
		if last.ctry then do;
			output;
		end;
	end;
enddata;
run;
quit;
/******************************************************************************
 real time 2.87 seconds
******************************************************************************/

/* 2. Which country had the highest sales volume (number of items sold?) */
proc ds2;
data country_items_ds2 /overwrite=yes;
   dcl char(50) Country ;
	dcl double "Items" having format comma16.; 
	retain Country;
	keep Country "Items";
	method run();
		set lookup.orders (keep=(Country));
		by Country;
		if first.country then do;
			"Items"=0;
		end;
		"Items"+1;
		if last.country then do;
			Country=put(country,$Country_name.);
			output;
		end;
	end;
enddata;
run;
quit;
/******************************************************************************
 real time 8.74 seconds
******************************************************************************/

/*3.	For each country, how many items purchased were produced domestically vs. imported */
proc ds2;
data country_imports_ds2 /overwrite=yes;
   dcl char(50) Country ;
	dcl double Domestic Imported having format comma16.; 
	retain Country Domestic Imported;
	keep Country Domestic Imported;
	method run();
		set lookup.orders (keep=(Country ProductID));
		by Country;
		if first.country then do;
			Domestic=0;
			Imported=0;
		end;
		if country=put(ProductID,Product_country.) then Domestic+1;
			else Imported+1;
		if last.country then do;
			Country=put(country,$Country_name.);
			output;
		end;
	end;
enddata;
run;
quit;

/******************************************************************************
 real time 12.3 seconds
******************************************************************************/

 

Conclusions

 

There are multiple ways to aggregate data in base SAS. Depending on your data, system architecture, and the complexity of the aggregation, run times can vary significantly.

 

  PROC FREQ PROC SUMMARY PROC TABLULATE SQL DATA Step Hash Object DS2
Aggregation 1 2.14 2.06   3.24 2.14   2.87
Aggregation 2 0.7 1.71   4.69 3.26 4.45 8.74
Aggregation 3 4.22   2.27 11.94 3.84   12.4

 

For my own coding, if the aggregation is simple, I tend to use SQL for the versatility and simplicity of coding, and performance is generally pretty good. If performance is my main concern, I’ll consider using the PROCS – FREQ, SUMMARY, or TABLULATE. Whenever the aggregation gets complex, or where I need to use my own aggregation logic, I use the DATA step. In any case, it pays to know about several different techniques. A little experimentation can determine which provides the best result for your aggregation.

 

What's your favorite "go to" for aggregation in SAS? Did you see anything new in this post that you would like to try?

 

Jordan_Mark_01_02.jpg
Until next time, may the SAS be with you!

Mark

 

 


References:

  1. SESUG Paper 81-2024 - This Works on So Many Levels: Building Summary Tables in SAS® with Hash Programming - David V. Cole, Harvard Pilgrim Health Care Institute
  2. Grab the ZIP file containing the code and data for this series from my GitHub at  https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip

Links to prior posts in this series:

 

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
2 weeks ago
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags