This is Part 4 in a series of articles about common data manipulation tasks:
Part 1 focused on appending data
Part 2 focused on sorting data
Part 3 focused on de-duplicating data.
Finally we’ll focus on aggregating data.
Aggregating data in CAS
Again, SAS and CAS provide multiple ways to achieve a task and aggregating data is no exception to the rule.
When a user needs to aggregate data, he/she might first think about doing that using SQL. In CAS, that means FedSQL:
proc fedsql sessref=mysession _method ;
create table dm.bigprdsale_fed
{options replication=0 replace=true} as
select country, product, prodtype, sum(actual) as actual, sum(predict) as predict
from dm.bigprdsale
group by country, product, prodtype ;
quit ;
That works. But one can also use the aggregate CAS action:
proc cas ;
aggregation.aggregate result=r status=s /
table={
name="bigprdsale",
groupBy={"country","product","prodtype"},
vars={"actual","predict"}
},
varSpecs={
{name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}}
{name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}}
}
casout={name="bigprdsale_aggregate", replace=True, replication=0} ;
quit ;
Or the summary CAS action:
proc cas ;
simple.summary result=r status=s /
inputs={"actual","predict"},
subSet={"SUM"},
table={
name="bigprdsale",
groupBy={"country","product","prodtype"}
},
casout={name="bigprdsale_summary", replace=True, replication=0} ;
quit ;
The summary CAS action creates one record per measure. So, it has to be transposed to mimic the result of the FedSQL aggregation or the aggregate CAS action:
proc cas ;
transpose.transpose /
table={
name='BIGPRDSALE_SUMMARY',
caslib='DM',
groupBy={"COUNTRY","PRODUCT","PRODTYPE"}
},
id={'_Column_'},
casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true},
transpose={'_Sum_'} ;
quit ;
How fast do they run?
In my case, I observed varying run times (43 million rows, 15 by-groups):
Technique
Time to run
FedSQL
36.41 seconds
Aggregate CAS action
8.16 seconds
Summary+Transpose CAS action
2.08 seconds
So, the summary+transpose combination seems to be a very efficient way of aggregating data.
I’ve seen more significant differences when computed values come into play for by-groups. Let’s have a look at the following sample example where an aggregation is performed on a variable that is computed on the fly using the techniques introduced above:
/* fedsql aggregation */
proc fedsql sessref=mysession _method ;
create table dm.bigprdsale_fed
{options replication=0 replace=true} as
select country, product, substr(prodtype,1,3) as onTheFly, sum(actual) as actual, sum(predict) as predict
from dm.bigprdsale
group by country, product, onTheFly ;
quit ;
/* aggregate CAS action */
proc cas;
aggregation.aggregate result=r status=s /
table={
name="bigprdsale",
groupBy={"country","product","onTheFly"},
vars={"actual","predict"},
computedVars={{name="onTheFly"}},
computedVarsProgram="onTheFly=substr(prodtype,1,3);"
},
varSpecs={
{name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}}
{name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}}
}
casout={name="bigprdsale_aggregate", replace=True, replication=0};
quit ;
/* summary CAS action aggregation + transpose to mimic same result set */
proc cas;
simple.summary result=r status=s /
inputs={"actual","predict"},
subSet={"SUM"},
table={
name="bigprdsale",
groupBy={"country","product","onTheFly"},
computedVars={{name="onTheFly"}},
computedVarsProgram="onTheFly=substr(prodtype,1,3);"
},
casout={name="bigprdsale_summary", replace=True, replication=0} ;
transpose.transpose /
table={
name='BIGPRDSALE_SUMMARY',
caslib='DM',
groupBy={"COUNTRY","PRODUCT","ONTHEFLY"}
},
id={'_Column_'},
casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true},
transpose={'_Sum_'} ;
quit ;
In FedSQL, the SELECT statement allows the user to define new computed columns on the fly. In CAS actions, we use the very helpful computedVars/computedVarsProgram options to deal with on-demand computed variables (see Steve Foerster’s article about that particular topic).
Here are the new run times when one of the by variable is computed on-the-fly:
Technique
Time to run
FedSQL
1:51.69 (111.69 seconds)
Aggregate CAS action
9.97 seconds
Summary+Transpose CAS action
3.60 seconds
Why do we see a much bigger difference between FedSQL and CAS actions? It’s because today in FedSQL a calculated value in the GROUP BY results in single-threaded execution on a single CAS worker.
Again, what I observed in my specific case could have been totally different in another environment with different data size and distribution (for example, if you have millions of by-groups). Testing is obviously necessary.
Takeaways
There are multiple ways to aggregate data in CAS
Depending on your data and your CAS architecture, they might run in very different times
Best practice: experiment with the aggregate, summary CAS actions or FedSQL GROUP BY operations to find the right technique that fits your case
Mistakes to avoid:
Use computed by-groups in FedSQL GROUP BY in Viya 3.4 (single-threaded)
In the last two articles (de-duplication and aggregation), I didn’t show FedSQL on CAS under its best shape. But don’t get me wrong, these two use cases are very specific. FedSQL can do much more than that and is a natural target for all legacy SQL processing that a customer might have in his SAS 9 environment. FedSQL on CAS is essential and can handle very complex queries efficiently.
Thanks for reading.
... View more