Aggregating your quantitative data is a simple way to summarize and present the story your data is telling. Do the values fall in a large range, or a small one? What's the average value, and how does it compare to the median?
With SAS Studio Flows, you can aggregate your data with multiple methods which are friendly to both point-and-click users and customization-loving users.
In this post, I’ll use the SASHELP.CARS data set to show you the aggregation capabilities of four steps: Summary Statistics, Query, Manage Columns, and Create Columns.
Summary Statistics step
The Summary Statistics step is designed for the SAS user that wants to summarize data with the click of a button. By default, this step will generate the mean, standard deviation, minimum, maximum, and count of non-missing values for selected numeric variables. You can enhance your results with options like grouping statistics by a qualitative variable, adding a weight variable, selecting additional statistics, and visualizing your results. Visit the documentation for a full description of step capabilities.
Through this post, I’ll be working with the SASHELP.CARS data set.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
After connecting your source data to the Summary Statistics node, select one or more analysis variables. I’d like to generate statistics for MSRP and Invoice.
All statistical analysis options appear on the Options tab. By default, all basic statistics (except the number of missing values) are selected. The maximum decimal for each statistic will be the best fit and the divisor for standard deviation and variance is degrees of freedom. Options like additional statistics, percentiles, and plots are available under their respective headings.
At times, selecting one of these options will surface additional options. For example, selecting the option to generate a histogram will surface additional plot options.
After running the step, the summary statistics are displayed in a report with plots beneath them. You can choose to save results in a table instead on the node’s Output tab.
I can further augment my results by adding a classification variable, Make. This will calculate summary statistics for each unique vehicle make included in the source data. I’ll combine classification variables by n-way only, meaning that my results will include only the grouped statistics and not the statistics summarizing the entire table.
The results will be split into separate rows for each classification variable value.
Query step
The Query step has appeared in my last two posts on sorting and deduplicating data. Yet another capability of the Query step is aggregating data. You can generate statistics by calculating new columns with statistical functions. Visit the documentation for a full description of step capabilities.
After connecting source data to the node, the columns in your data will appear on the Options tab. Above that list is the option to calculate a new column.
Selecting this option will surface the Expression Builder. Here, you can manually type your calculation or build it by selecting data and functions from the left menu. Column attributes can be added at the bottom of the window. In this case, I’d like to calculate the mean of the MPG_City column.
As columns are created, they will appear in the Query node’s selected column list. I’ll create two more columns: one to calculate the mean of the MPG_Highway column, then one to calculate the mean of Avg_MPG_City and Avg_MPG_Highway.
When any column is selected in the column list, the option to Convert to Aggregate becomes available. Clicking this option allows you to select from a list of summary functions to apply to the selected column, which further simplifies the aggregation process.
Running this step selected yields the results below: three average MPG values.
What if I want to calculate these values for each distinct vehicle Make, like I did in the Summary Statistics step?
I’d simply add Make to my column list, then add Make on the Groups variable tab as well.
The results are now grouped by distinct Make values.
Manage Columns Step
The Manage Columns step is a simpler step with more focused capabilities. Manage Columns is used for tasks like selecting columns for output, changing column attributes, reordering columns, and calculating new columns. Visit the documentation for a full description of step capabilities.
Much like the Query step, the Manage Columns step has an option to create a new column which will surface the same Expression Builder interface shown previously.
With this, I can create output identical to my initial Query step results.
However, I can't go further here and calculate statistics for the distinct values of a grouping variable. You can only calculate full column aggregations with Manage Columns.
Calculate Columns Step
The Calculate Columns step is used to create new columns based on existing data. Though it sounds like the previous two methods, there’s a major difference. While Query and Manage Columns calculate statistics down a column, Calculate Columns can only calculate statistics across a row. Visit the documentation for a full description of step capabilities.
The Calculate Columns step enables you to transform existing columns or create new ones. Transformation options include type conversion, casing, value extraction, and formatting.
Again, the option to create an entirely new column surfaces the Expression Builder. This time, I’ll create a new column named Profit which subtracts the Invoice from the MSRP for each row.
Running this step shows that a Profit value was calculated for each row.
Considerations & Summary
When choosing which step to use for aggregating data, several factors come into consideration. While the Query and Manage Columns steps are available with SAS Studio Basic, the Summary Statistics and Calculate Columns steps require a SAS Studio Analyst license (or higher). Find information on SAS Studio licensing here.
The Summary Statistics step makes it easy to generate desired statistics and create attractive reports. The Query and Manage Column steps allow you to have maximum control the statistics you calculate and how they're calculate, though the Query step has the additional capability of grouping results. The Calculate Columns step offers a unique output from the other steps, as it enables aggregating across a row instead of down a column.
In this post, I’ve discussed four methods for aggregating data in SAS Studio Flows. Want to know more about data manipulation in SAS Studio? Check out my previous blogs in this series on appending, sorting, and deduplicating data.
Find more articles from SAS Global Enablement and Learning here.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.