This is Part 4 in a series of articles about common data manipulation tasks:
Finally we’ll focus on aggregating data.
Again, SAS and CAS provide multiple ways to achieve a task and aggregating data is no exception to the rule.
When a user needs to aggregate data, he/she might first think about doing that using SQL. In CAS, that means FedSQL:
proc fedsql sessref=mysession _method ;
create table dm.bigprdsale_fed
{options replication=0 replace=true} as
select country, product, prodtype, sum(actual) as actual, sum(predict) as predict
from dm.bigprdsale
group by country, product, prodtype ;
quit ;
That works. But one can also use the aggregate CAS action:
proc cas ;
aggregation.aggregate result=r status=s /
table={
name="bigprdsale",
groupBy={"country","product","prodtype"},
vars={"actual","predict"}
},
varSpecs={
{name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}}
{name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}}
}
casout={name="bigprdsale_aggregate", replace=True, replication=0} ;
quit ;
Or the summary CAS action:
proc cas ;
simple.summary result=r status=s /
inputs={"actual","predict"},
subSet={"SUM"},
table={
name="bigprdsale",
groupBy={"country","product","prodtype"}
},
casout={name="bigprdsale_summary", replace=True, replication=0} ;
quit ;
The summary CAS action creates one record per measure. So, it has to be transposed to mimic the result of the FedSQL aggregation or the aggregate CAS action:
proc cas ;
transpose.transpose /
table={
name='BIGPRDSALE_SUMMARY',
caslib='DM',
groupBy={"COUNTRY","PRODUCT","PRODTYPE"}
},
id={'_Column_'},
casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true},
transpose={'_Sum_'} ;
quit ;
How fast do they run?
In my case, I observed varying run times (43 million rows, 15 by-groups):
Technique | Time to run |
FedSQL | 36.41 seconds |
Aggregate CAS action | 8.16 seconds |
Summary+Transpose CAS action | 2.08 seconds |
So, the summary+transpose combination seems to be a very efficient way of aggregating data.
I’ve seen more significant differences when computed values come into play for by-groups. Let’s have a look at the following sample example where an aggregation is performed on a variable that is computed on the fly using the techniques introduced above:
/* fedsql aggregation */
proc fedsql sessref=mysession _method ;
create table dm.bigprdsale_fed
{options replication=0 replace=true} as
select country, product, substr(prodtype,1,3) as onTheFly, sum(actual) as actual, sum(predict) as predict
from dm.bigprdsale
group by country, product, onTheFly ;
quit ;
/* aggregate CAS action */
proc cas;
aggregation.aggregate result=r status=s /
table={
name="bigprdsale",
groupBy={"country","product","onTheFly"},
vars={"actual","predict"},
computedVars={{name="onTheFly"}},
computedVarsProgram="onTheFly=substr(prodtype,1,3);"
},
varSpecs={
{name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}}
{name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}}
}
casout={name="bigprdsale_aggregate", replace=True, replication=0};
quit ;
/* summary CAS action aggregation + transpose to mimic same result set */
proc cas;
simple.summary result=r status=s /
inputs={"actual","predict"},
subSet={"SUM"},
table={
name="bigprdsale",
groupBy={"country","product","onTheFly"},
computedVars={{name="onTheFly"}},
computedVarsProgram="onTheFly=substr(prodtype,1,3);"
},
casout={name="bigprdsale_summary", replace=True, replication=0} ;
transpose.transpose /
table={
name='BIGPRDSALE_SUMMARY',
caslib='DM',
groupBy={"COUNTRY","PRODUCT","ONTHEFLY"}
},
id={'_Column_'},
casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true},
transpose={'_Sum_'} ;
quit ;
In FedSQL, the SELECT statement allows the user to define new computed columns on the fly. In CAS actions, we use the very helpful computedVars/computedVarsProgram options to deal with on-demand computed variables (see Steve Foerster’s article about that particular topic).
Here are the new run times when one of the by variable is computed on-the-fly:
Technique | Time to run |
FedSQL | 1:51.69 (111.69 seconds) |
Aggregate CAS action | 9.97 seconds |
Summary+Transpose CAS action | 3.60 seconds |
Why do we see a much bigger difference between FedSQL and CAS actions? It’s because today in FedSQL a calculated value in the GROUP BY results in single-threaded execution on a single CAS worker.
Again, what I observed in my specific case could have been totally different in another environment with different data size and distribution (for example, if you have millions of by-groups). Testing is obviously necessary.
In the last two articles (de-duplication and aggregation), I didn’t show FedSQL on CAS under its best shape. But don’t get me wrong, these two use cases are very specific. FedSQL can do much more than that and is a natural target for all legacy SQL processing that a customer might have in his SAS 9 environment. FedSQL on CAS is essential and can handle very complex queries efficiently.
Thanks for reading.
Hi all,
Just for your information that the above comparison is not true in every case. In one of our customer environments the case was very different. Most likely because our data volumes caused us to run out of memory.
This article helped us a lot to look for other solutions when simple.summary did not work at all. It could not utilize the CAS work disk and terminated into an error.
Aggregate CAS action could utilize CAS work disk but it took 7 minutes to run.
And FedSQL made everything ready in one minute 😉
Our lesson learned was that you should test different options in your environment. There are many things which change the situation: data volume vs available memory, disk speed, CPU count and speed, etc.
Thanks for the tip to try and test different solutions.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.