In a recent article in the Free Data Friday series, I demonstrated a simple approach to identifying and handling outliers in “classic” SAS using SAS OnDemand for Academics. There we used SAS Procedures (SGPlot and SQL) to deal with some unrealistically large numbers. In this article I will be using SAS Viya for Learners to show how we can do the same thing with the SAS Cloud Analytic Services Language (CASL) during the Data Preparation phase of a project by creating a Custom Data Transformation.
SAS Viya for Learners is a free, cloud-based software offering which allows educators and students to explore the entire analytics life cycle from data preparation through analysis and vizualization - see here for more information and getting started.
We will be using the Bank Customers data set which can be found by selecting Prepare Data from the SAS Viya for Learners Main Menu as shown below:
We need to use a Plan so choose New Plan:
This takes us to a screen where we can choose which data set to use - choose the Bank Customers Data Set from the left hand panel and click OK:
This returns us to the Prepare Data Screen with the Bank Customers file loaded ready to add Transforms to get our data ready for processing.
There are a large number of pre-built transformations available from the Transforms menu, however, none of them facilitate outlier handling so we will need to create our own. To do this we can double-click the Code option under Custom Transforms. This adds a Custom Code transform to the plan where we can choose Data Step or CASL from the drop-down menu. We need to choose CASL from the drop down menu and some boilerplate code will automatic be added to the code window.
This text can now be replaced with the code to carry out our outlier detection and handling. This can be seen below.
The important points to note are as follows:
Line 2 - Outlier handling is part of the Data Preprocess Action Set.
Line 3 - The input library and table names are both held in variables (_dp_InputCaslib and _dp_InputTable). This allows you to change the input table name without changing the code.
Line 4 - The variable hh_income is the one to be analysed for outliers.
Line 5 - There are a number of different outlier detection methods available. For the sake of simplicity I have chosen the Udflimits method. This allows me to specify minimum and maximum values outside of which values will be treated as outliers.
Line 6 - Any observations with values found to be outliers will be removed from the file.
Line 9 - The minimum and maximum values described at Line 4 are specified.
Line 10 - Outlier information can be written to a separate file.
Line11 - As with the input library and table name the output library ad table names are held in variables.
Finally when we run the plan, the transform will be exexcuted and observations outside the specified minimum and maximum ranges will be removed from the output file.
This is only one example of what you can do with Custom Transforms in both data step and CASL code. The next time you need to use a process which isn't covered by one of the pre-built transforms why not try creating one of your own?
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.