CAS answers to 4 common data manipulation tasks – Part 4 – AGGREGATE

4 Likes

This is Part 4 in a series of articles about common data manipulation tasks:

Part 1 focused on appending data
Part 2 focused on sorting data
Part 3 focused on de-duplicating data.

Finally we’ll focus on aggregating data.

Aggregating data in CAS

Again, SAS and CAS provide multiple ways to achieve a task and aggregating data is no exception to the rule.

When a user needs to aggregate data, he/she might first think about doing that using SQL. In CAS, that means FedSQL:

proc fedsql sessref=mysession _method ;
   create table dm.bigprdsale_fed
   {options replication=0 replace=true} as
   select country, product, prodtype, sum(actual) as actual, sum(predict) as predict
   from dm.bigprdsale
   group by country, product, prodtype ;
quit ;

That works. But one can also use the aggregate CAS action:

proc cas ;
   aggregation.aggregate result=r status=s /
      table={
         name="bigprdsale",
         groupBy={"country","product","prodtype"},
         vars={"actual","predict"}
      },
      varSpecs={
         {name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}}
         {name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}}
      }
      casout={name="bigprdsale_aggregate", replace=True, replication=0} ;
quit ;

Or the summary CAS action:

proc cas ;
   simple.summary result=r status=s /
      inputs={"actual","predict"},
      subSet={"SUM"},
      table={
         name="bigprdsale",
         groupBy={"country","product","prodtype"}
      },
      casout={name="bigprdsale_summary", replace=True, replication=0} ;
quit ;

The summary CAS action creates one record per measure. So, it has to be transposed to mimic the result of the FedSQL aggregation or the aggregate CAS action:

proc cas ;
   transpose.transpose / 
      table={
         name='BIGPRDSALE_SUMMARY',
         caslib='DM',
         groupBy={"COUNTRY","PRODUCT","PRODTYPE"}
      },
      id={'_Column_'},
      casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true},
      transpose={'_Sum_'} ;
quit ;

How fast do they run?

In my case, I observed varying run times (43 million rows, 15 by-groups):

Technique	Time to run
FedSQL	36.41 seconds
Aggregate CAS action	8.16 seconds
Summary+Transpose CAS action	2.08 seconds

So, the summary+transpose combination seems to be a very efficient way of aggregating data.

I’ve seen more significant differences when computed values come into play for by-groups. Let’s have a look at the following sample example where an aggregation is performed on a variable that is computed on the fly using the techniques introduced above:

/* fedsql aggregation */
proc fedsql sessref=mysession _method ;
   create table dm.bigprdsale_fed
   {options replication=0 replace=true} as
   select country, product, substr(prodtype,1,3) as onTheFly, sum(actual) as actual, sum(predict) as predict
   from dm.bigprdsale
   group by country, product, onTheFly ;
quit ;

/* aggregate CAS action */
proc cas;
   aggregation.aggregate result=r status=s /
      table={
         name="bigprdsale",
         groupBy={"country","product","onTheFly"},
         vars={"actual","predict"},
         computedVars={{name="onTheFly"}},
         computedVarsProgram="onTheFly=substr(prodtype,1,3);"
      },
      varSpecs={
         {name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}}
         {name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}}
      }
      casout={name="bigprdsale_aggregate", replace=True, replication=0};
quit ;

/* summary CAS action aggregation + transpose to mimic same result set */
proc cas;
   simple.summary result=r status=s /
      inputs={"actual","predict"},
      subSet={"SUM"},
      table={
         name="bigprdsale",
         groupBy={"country","product","onTheFly"},
         computedVars={{name="onTheFly"}},
         computedVarsProgram="onTheFly=substr(prodtype,1,3);"
         },
      casout={name="bigprdsale_summary", replace=True, replication=0} ;
   transpose.transpose /
      table={
         name='BIGPRDSALE_SUMMARY',
         caslib='DM',
         groupBy={"COUNTRY","PRODUCT","ONTHEFLY"}
      },
      id={'_Column_'},
      casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true}, 
      transpose={'_Sum_'} ;
quit ;

In FedSQL, the SELECT statement allows the user to define new computed columns on the fly. In CAS actions, we use the very helpful computedVars/computedVarsProgram options to deal with on-demand computed variables (see Steve Foerster’s article about that particular topic).

Here are the new run times when one of the by variable is computed on-the-fly:

Technique	Time to run
FedSQL	1:51.69 (111.69 seconds)
Aggregate CAS action	9.97 seconds
Summary+Transpose CAS action	3.60 seconds

Why do we see a much bigger difference between FedSQL and CAS actions? It’s because today in FedSQL a calculated value in the GROUP BY results in single-threaded execution on a single CAS worker.

Again, what I observed in my specific case could have been totally different in another environment with different data size and distribution (for example, if you have millions of by-groups). Testing is obviously necessary.

Takeaways

There are multiple ways to aggregate data in CAS
Depending on your data and your CAS architecture, they might run in very different times
Best practice: experiment with the aggregate, summary CAS actions or FedSQL GROUP BY operations to find the right technique that fits your case
Mistakes to avoid:
- Use computed by-groups in FedSQL GROUP BY in Viya 3.4 (single-threaded)

In the last two articles (de-duplication and aggregation), I didn’t show FedSQL on CAS under its best shape. But don’t get me wrong, these two use cases are very specific. FedSQL can do much more than that and is a natural target for all legacy SQL processing that a customer might have in his SAS 9 environment. FedSQL on CAS is essential and can handle very complex queries efficiently.

Thanks for reading.

TapioKalmi · ‎01-28-2022

Hi all,

Just for your information that the above comparison is not true in every case. In one of our customer environments the case was very different. Most likely because our data volumes caused us to run out of memory.

This article helped us a lot to look for other solutions when simple.summary did not work at all. It could not utilize the CAS work disk and terminated into an error.

Aggregate CAS action could utilize CAS work disk but it took 7 minutes to run.

And FedSQL made everything ready in one minute 😉

Our lesson learned was that you should test different options in your environment. There are many things which change the situation: data volume vs available memory, disk speed, CPU count and speed, etc.

Thanks for the tip to try and test different solutions.