BookmarkSubscribeRSS Feed

CAS answers to 4 common data manipulation tasks – Part 4 – AGGREGATE

Started ‎07-18-2019 by
Modified ‎07-18-2019 by
Views 4,598

This is Part 4 in a series of articles about common data manipulation tasks:

Finally we’ll focus on aggregating data.

Aggregating data in CAS

Again, SAS and CAS provide multiple ways to achieve a task and aggregating data is no exception to the rule.

 

When a user needs to aggregate data, he/she might first think about doing that using SQL. In CAS, that means FedSQL:

 

proc fedsql sessref=mysession _method ;
   create table dm.bigprdsale_fed
   {options replication=0 replace=true} as
   select country, product, prodtype, sum(actual) as actual, sum(predict) as predict
   from dm.bigprdsale
   group by country, product, prodtype ;
quit ;

 

That works. But one can also use the aggregate CAS action:

 

proc cas ;
   aggregation.aggregate result=r status=s /
      table={
         name="bigprdsale",
         groupBy={"country","product","prodtype"},
         vars={"actual","predict"}
      },
      varSpecs={
         {name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}}
         {name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}}
      }
      casout={name="bigprdsale_aggregate", replace=True, replication=0} ;
quit ;

 

Or the summary CAS action:

 

proc cas ;
   simple.summary result=r status=s /
      inputs={"actual","predict"},
      subSet={"SUM"},
      table={
         name="bigprdsale",
         groupBy={"country","product","prodtype"}
      },
      casout={name="bigprdsale_summary", replace=True, replication=0} ;
quit ;

 

The summary CAS action creates one record per measure. So, it has to be transposed to mimic the result of the FedSQL aggregation or the aggregate CAS action:

 

proc cas ;
   transpose.transpose / 
      table={
         name='BIGPRDSALE_SUMMARY',
         caslib='DM',
         groupBy={"COUNTRY","PRODUCT","PRODTYPE"}
      },
      id={'_Column_'},
      casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true},
      transpose={'_Sum_'} ;
quit ;

 

How fast do they run?

 

In my case, I observed varying run times (43 million rows, 15 by-groups):

 

Technique Time to run
FedSQL 36.41 seconds
Aggregate CAS action 8.16 seconds
Summary+Transpose CAS action 2.08 seconds

 

So, the summary+transpose combination seems to be a very efficient way of aggregating data.

 

I’ve seen more significant differences when computed values come into play for by-groups. Let’s have a look at the following sample example where an aggregation is performed on a variable that is computed on the fly using the techniques introduced above:

 

/* fedsql aggregation */
proc fedsql sessref=mysession _method ;
   create table dm.bigprdsale_fed
   {options replication=0 replace=true} as
   select country, product, substr(prodtype,1,3) as onTheFly, sum(actual) as actual, sum(predict) as predict
   from dm.bigprdsale
   group by country, product, onTheFly ;
quit ;

/* aggregate CAS action */
proc cas;
   aggregation.aggregate result=r status=s /
      table={
         name="bigprdsale",
         groupBy={"country","product","onTheFly"},
         vars={"actual","predict"},
         computedVars={{name="onTheFly"}},
         computedVarsProgram="onTheFly=substr(prodtype,1,3);"
      },
      varSpecs={
         {name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}}
         {name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}}
      }
      casout={name="bigprdsale_aggregate", replace=True, replication=0};
quit ;

/* summary CAS action aggregation + transpose to mimic same result set */
proc cas;
   simple.summary result=r status=s /
      inputs={"actual","predict"},
      subSet={"SUM"},
      table={
         name="bigprdsale",
         groupBy={"country","product","onTheFly"},
         computedVars={{name="onTheFly"}},
         computedVarsProgram="onTheFly=substr(prodtype,1,3);"
         },
      casout={name="bigprdsale_summary", replace=True, replication=0} ;
   transpose.transpose /
      table={
         name='BIGPRDSALE_SUMMARY',
         caslib='DM',
         groupBy={"COUNTRY","PRODUCT","ONTHEFLY"}
      },
      id={'_Column_'},
      casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true}, 
      transpose={'_Sum_'} ;
quit ;

 

In FedSQL, the SELECT statement allows the user to define new computed columns on the fly. In CAS actions, we use the very helpful computedVars/computedVarsProgram options to deal with on-demand computed variables (see Steve Foerster’s article about that particular topic).

 

Here are the new run times when one of the by variable is computed on-the-fly:

 

Technique Time to run
FedSQL 1:51.69 (111.69 seconds)
Aggregate CAS action 9.97 seconds
Summary+Transpose CAS action 3.60 seconds

 

Why do we see a much bigger difference between FedSQL and CAS actions? It’s because today in FedSQL a calculated value in the GROUP BY results in single-threaded execution on a single CAS worker.

 

Again, what I observed in my specific case could have been totally different in another environment with different data size and distribution (for example, if you have millions of by-groups). Testing is obviously necessary.

Takeaways

  • There are multiple ways to aggregate data in CAS
  • Depending on your data and your CAS architecture, they might run in very different times
  • Best practice: experiment with the aggregate, summary CAS actions or FedSQL GROUP BY operations to find the right technique that fits your case
  • Mistakes to avoid:
    • Use computed by-groups in FedSQL GROUP BY in Viya 3.4 (single-threaded)

In the last two articles (de-duplication and aggregation), I didn’t show FedSQL on CAS under its best shape. But don’t get me wrong, these two use cases are very specific. FedSQL can do much more than that and is a natural target for all legacy SQL processing that a customer might have in his SAS 9 environment. FedSQL on CAS is essential and can handle very complex queries efficiently.

 

Thanks for reading.

Comments

Hi all,

 

Just for your information that the above comparison is not true in every case. In one of our customer environments the case was very different. Most likely because our data volumes caused us to run out of memory.

 

This article helped us a lot to look for other solutions when simple.summary did not work at all. It could not utilize the CAS work disk and terminated into an error.

 

Aggregate CAS action could utilize CAS work disk but it took 7 minutes to run.

 

And FedSQL made everything ready in one minute 😉

 

Our lesson learned was that you should test different options in your environment. There are many things which change the situation: data volume vs available memory, disk speed, CPU count and speed, etc.

 

Thanks for the tip to try and test different solutions.

Version history
Last update:
‎07-18-2019 10:20 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags