SAS Studio Tasks have been phased out from recent versions of SAS Viya and have been replaced with SAS Studio Flows. As someone who used SAS Studio Tasks frequently, it took me some time to switch over to flows. Fortunately, flows turned out to be easy to use. In this post, I’ll describe how to use SAS Studio Flows as a point-and-click replacement for the previous SAS Studio Tasks. I will show follow-along instructions for building a flow to clean data and apply machine learning models, using data available in Viya for Learners. This post is geared toward people working in analytics who are new to flows.
What is a flow?
A SAS Studio Flow is a sequence of operations on data. They allow the user to accomplish a variety of things such as summarizing and cleaning data and fitting a variety of statistical and machine learning models, all through a point-and-click interface. If you have previously worked with the SAS Studio Tasks, flows are their replacement. They accomplish most of the same things that tasks did, and they add a visual aspect that can potentially help users think through their data manipulation and analyses.
A flow is built by adding steps to a flow canvas by finding the Steps menu in the Navigation pane on the left side of SAS Studio. Double clicking on a step causes it to appear as a node in the flow canvas area of an open flow tab. These nodes can be connected, indicating that the output from one node is the input to the sequentially connected node.
In the picture below, the flow has 2 nodes: a table node labeled PVA_DONORS connected to a List Data node. Once the nodes are connected, they can be dragged anywhere around the canvas, and they will remain connected. If your flow starts to look messy, you can click the Arrange Nodes button to have them automatically arranged visually on the canvas. Options for each node can be set in the Node details area through the point-and-click interface. These details can be minimized and maximized using the buttons on the upper right side of the details area.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Shortcomings of flows
Let’s get my one complaint about flows out of the way. Unlike the previous SAS Studio Tasks, the code generated from flows is difficult to edit. For example, when using logistic regression, the old logistic regression task (using SAS Viya 2024.09) produced ~15 lines of code. It was straight forward to edit the program and add the options I commonly use (e.g., the ASSOCIATION option in the PROC LOGSELECT statement that calculates the concordance index). The corresponding logistic regression step in the SAS Studio flow will produce over 1000 lines of code. For me, it would be easier to write the PROC LOGSELECT code from scratch than to edit such a large program. Fortunately, when I need an option that isn’t included in the Steps, it is straightforward to use a SAS program node and code it manually.
What I like about flows
Flows have a visual aspect that didn’t exist for the SAS Studio Tasks. Tasks tended to handle one specific operation at a time, where flows help me visualize the whole process of data analysis. I haven’t been using them for very long, but I’m finding the visual aspect helps me think about my analytics problem.
I’ll typically go through a sequence of steps including data exploration, modification, feature selection, modeling, and scoring. When these processes are visually linked, it’s easier to quickly absorb what I’ve done and what I have left to do from the picture than from a long program.
A beginner friendly flow demonstration
In this next section, I’ll walk through building a flow in SAS Viya for Learners. Follow along with these instructions and you’ll get more comfortable using flows. We’ll be using the PVA_DONORS data set which is available in VFL. It’s currently located in Files/Courses/VST1, but this could change in the future.
The PVA_DONORS data set contains data on potential donors that a charity will ask for donations. The analytic goal is to build a predictive model using the target variable Response to determine how likely each potential customer is to donate. Before modeling, one predictor (DemMedIncome) needs to be recoded because the value zero was used as a placeholder for missing values. We will recode the values to missing, then we will impute prior to modeling.
Set up libraries and upload data into memory
Start off by creating a blank flow canvas by clicking the New menu then choosing Flow. You can also create a blank flow by clicking the plus sign (+) next to a SAS Studio tab. Now navigate to the Steps menu and, under Develop, double click on the SAS Program step to add a node to the canvas. Click on the SAS Program node and paste this code into the code tab in details area below the canvas. Click the Run entire flow button. This will start a cas session, create a library, and load the PVA_DONORS data into memory.
cas;
libname FLOWDEMO cas caslib=casuser;
%let homedir=%sysget(HOME);
%put &homedir;
proc casutil;
load file="&homedir/Courses/VST1/pva_donors.sas7bdat"
outcaslib="casuser" casout="PVA_DONORS" replace;
run;
Add PVA_DONORS to the flow and recode data
Now go to the Libraries pane and find the icon for PVA_DONORS in the FLOWDEMO library. You can click and drag this icon onto your flow canvas and it will appear as a new table node. Next, we’ll add a few more nodes to the canvas. Go back to Steps and expand the Transform Data group. Double click on the Recode Values step. The node added to your flow has a box on the left side. This is an input port, and we’ll use it to connect this node to our data table. Bring your cursor to the right of the PVA_DONORS node until your pointer turns into a hand. Now click and drag the arrow that emerges until it touches the input port of the Recode Values node. We’ll need an output port to save the recoded data to a new data table. Right click on the Recode Values node and choose Add Output Port.
Go back to the Steps menu and double click on the Table step (found under Data). Connect the output port from Recode Values to the new Table node. Your flow might be looking messy, so click the Arrange Nodes button:
Your flow should look like this:
Let’s set the details for the Recode Values node. The variable we want to recode is called DemMedIncome and we want to set any $0 amounts to a missing value indicator (a single period). Under the details for Recode Values, in the Data tab, make sure Recode is set to Numeric variable, then set Variable to recode to DemMedIncome. Move to the Values tab and enter 0 for the Old value and a single period (.) for the New value. On the Output tab, put RecodedMedIncome as the variable name and click the radio button labeled “Write to another data set”.
Now click on the unnamed Table node and navigate to the Table properties in the details area. Fill in the FLOWDEMO library and use RECODED_PVA for the table name. In the flow diagram, right-click on the RECODED_PVA node and choose Run to node. Under the Submitted Code and Results tab, click on Output Data to see the new column.
Impute missing values
Next, we’ll impute missing values for the missing data. In the Steps menu, under Prepare Data, double click on the Imputation step, then do the same for a Table step. Give the Imputation node an output port. Connect the RECODED_PVA node to input port of the Imputation node, then connect the output port to the Table node. In the details for the Table node, use the FLOWDEMO library and call the table PVA_CLEAN.
Go to the details for the Imputation node. Scroll down in the first tab to the “Replace missing values with a random number” field. Add RecodedMedIncome to this field. This imputes a uniformly distributed number between the minimum and maximum values. I like using this option when I have a lot of missing data like in RecodedMedIncome, which has ~25% missing values. This approach avoids adding a big spike at a single value such as with mean or median imputation. For more information in imputation, please see Imputation: what to use and when?
Next, click on the Output tab and make sure that Save Imputed Data is checked off and the All variables radio button is selected. In the flow diagram, right-click on PVA_CLEAN and choose Run to node. The Results tab shows the imputed values were saved as a column named IM_RecodedMedIncome:
And the flow appears as follows:
Add machine learning nodes to the flow
Next, we’ll add some machine learning models to the flow. We’ll add nodes for a Forest and Gradient Boosting. Note that currently not all model types are available as a step. For any model that is not currently available, it can be added by using a SAS program node and manually coded.
The nodes for these can be added by double clicking the corresponding steps listed under the Machine Learning heading. Connect PVA_CLEAN to the input ports of both nodes. We’ll set the details for the Forest node first, then match them for the Gradient Boosting node.
For predictive models, it’s important to honestly assess performance through data splitting. So, in the details under the Data tab, check the box for Include Validation data. Specify a sample proportion and set it to 0.33 to use 2/3 the data for training and 1/3 for validation. Use a nominal target and choose the variable Response.
For Interval inputs, select all variables by checking the first box (by the Name header), then de-select DemMedIncome, Donation_Amt, and RecodedMedIncome. For Nominal inputs, select StatusCat96NK, DemCluster, DemGender, DemHomeOwner. On the Options tab, check off both plots: Misclassification by number of trees and Variable Importance chart. Leave rest of the settings are at their default values.
For the Gradient Boosting model, choose the same variables and plots, and leave other settings at their defaults. Use the Arrange Nodes button if necessary. Right-click on the first node in the flow (PVA_DONORS), and choose Run from node, to run the flow. Look at the results from the two models.
Model results
The Forest model had a validation misclassification of 0.430 using 100 trees. The lowest misclassification was 0.427 when the Forest had about 25 trees. This information is displayed in tabular form in the report, but I’ll just show the graphs here:
Gradient boosting had a similar misclassification of 0.434 using 100 trees and had its lowest misclassification of about 0.418 when there were 15 trees in the model.
Your results will differ from these. Why? The distributed computing environment itself adds variability to the results. Probably a bigger source of variability is randomization built into both machine learning models. Gradient boosting involves taking random subsets of the observations to build each tree in the ensemble. Forests also take random subsets of the rows and additionally use a random subset of the predictors to build each tree. Randomly partitioning the data into training and validation and imputing random values also contribute to variability in the results.
Add swimlanes and annotations to the flow
On the right side of the canvas is the Submission order button (below Properties) which we’ll use next. Press the button and click Enable submission order. Find the row with the SAS Program node and add the Swimlane Name “setup”. For the other row with the longer flow, add the Swimlane Name “main flow”. Use the arrows above the Swimlane Name field to make sure setup has Order 1 (meaning it will execute first) and main flow has Order 2 (it will execute second).
What’s the point of swimlanes? The next time this flow is run, the SAS program node that starts the cas session, creates the FLOWDEMO library, and uploads the data will run first before main flow. Swimlanes are helpful when parts of the flow need to be run in order because of operation dependencies.
The last thing we’ll do is to add annotations. I like to include lots of comments in my SAS programs, and this is the closest thing I can do for a flow. Click on the SAS Program node and under details, on the notes tab, add a note such as “start cas, create library, upload data”. This note gets anchored underneath the corresponding node.
Next, use the Add button in the top middle of the canvas to Add Notes to the canvas. Write in “This is a brief flow demo.” for the note. The note is free-floating and can be dragged and resized as needed. Here’s what my finished flow looks like:
There are a few other things I’ve learned while getting used to flows:
Hopefully this tutorial has helped people who are new to flows to get comfortable using them. I expect a lot more steps to become available in future releases of SAS Viya.
Additional information
Here’s the link to my imputation post that was mentioned earlier:
Imputation: what to use and when?
For information on adding snippets to SAS Studio flows, see this post by Mary Kathryn Queen:
SAS Studio Flows: Adding Code Snippets
Find more articles from SAS Global Enablement and Learning here.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.