This article will show you how to use the Transformations Node in SAS Model Studio. Similarly to the Transform Variables node in SAS Enterprise Miner, the Transformations Node helps you transform data to prepare it for modeling. The Transformations Node also will create interaction variables for you.
There are many reasons to transform or rescale data. For more details on that see my article on Changing the Scale: Transforming Data. In this post, I focus on transforming the data to meet model assumptions so that we are assured of getting model results that are as accurate as possible.
SAS Model Studio’s Transformations Node can accomplish quite a few different transformations. Transformations are deterministic mathematical functions that can help stabilize variances, remove nonlinearity, and correct non-normal distributions to improve the accuracy of your model results.
To illustrate this, I have a sample of right skewed data below with the original Example Input Variable data along with the result of common transformations.
To see how these look graphed, see the image below.
Notice that the Log (natural log) and Log 10 transformations work well on this right-skewed data. The square root works somewhat. Notice that the square and inverse square root don’t seem to help. The square and inverse square root are commonly used to address right skewed data, not left-skewed data.
We can illustrate how transformations work with a single variable, because we can see this visually in two dimensions. Transformations also work with multiple input variables in higher dimensions, but we can no longer visualize it on a flat screen. In that case we could use residual plots to help us determine if our transformations were improving our model.
I started here with an automatically generated pipeline using the Advanced template for the class target in SAS Model Studio.
I have used the HMEQ data set, a home equity data set used to demonstrate SAS software. It has a target variable BAD which is a 1,0 binary variable that indicates whether a person defaulted on their loan (1) or did not default (0). The data set has both categorical and interval input variables. Categorical variables are, for example, job and reason. Some of the interval variables are debt-to-income ratio (DEBTINC) and home value (VALUE).
I have already imputed missing values (see my previous article The Usefulness of the Data Exploration Node in Model Studio). Our next step is to add the Transformations node following the Imputation node. We right click on the Imputation node, and select Add child node, Data Mining Preprocessing, Transformations. This is shown in the image below.
The Transformations node is a Data Mining Preprocessing node. By default, it replaces the current input variable (in our case an imputed input variable that replaced the original input variable) with the selected transformation function of that variable.
The Transformations node allows the transformation of both interval and class variables.
For interval variables, simple transformations, standardizing transformations, and binning transformations are available as shown below:
Note that in the Transformations node, 1 is added to the input variable before log transforming it to avoid taking the log of 0.
For each transformation a new variable is created as was done with imputation. Recall that imputed values variables would start with IMP_. With transformations the prefix will depend on the transformation that was used. For example variables that were log transformed will begin with LOG_. Variables they use the square transformation begin with SQR_, etc. The original variables Are not deleted from the data set but by default they are by default set to Rejected, so they are not used in the model.
Be aware that any variables in your dataset with SAS formats of date, time, or datetime will be neither transformed nor rejected. They will simply be ignored.
When you select a Transformations node, the Options pane will appear in the right pane.
Let's start by using a Best transformation.
The Best transformation selects a transformation from our list of simple transformations based on a ranking criterion.
For a binary target such as we have, the criteria for a binary target available are:
For more details about the Best transformation, see the article on Best transformation – a new feature in SAS Model Studio 8.3.
Looking at our Results, the Output box, we see that the square root was used on 5 variables and the inverse square on 3 variables.
Each of these transformations was selected based on minimizing the moment skewness for that variable. A table of the various transformations and the Moment Skewness of each is provided.
Let's say you don't want to use the same transformation for all of your variables and you don't want to use Best to let the software decide. Maybe you're a micromanager like me and you want to set each variable to the transformation that you choose. You can do this easily either by going to the SAS Model Studio Data Tab OR by using the Manage Variables Node.
Now let’s transform our imputed debt-to-income ratio variable (IMP_DEBTINC) using a log transformation, instead of the square root transformation that the Best option chose. Let’s do it with the Manage Variables Node. Note that you must run the Manage Variables Node before you can open it. Select the Manage Variables Node and select the Manage Variables button.
Select IMP_DEBTINC and change the New transform from Default to Log.
Save, close and run the node. Add a Transformations node below the Manage Variables Node and run it. In our results we see that only the debt-to-income ratio variable was transformed.
Let’s go back and change the Default class inputs method in the Transformations node from Default to Best, run the node again, and look at our Results.
Notice that we now have many transformations. Notice also, that instead of using the SQRT, the LOG is now used for IMP_DEBTINC because the Manage Variables Node assigned LOG and that takes precedence over the transformation that Best assigned by the Transformations node. If you want to ignore the assignments made by the Manage Variables Node, you would have to select Ignore methods in metadata under the Transformations node options.
If both the Data Tab and the Manage Variables Node are set for transformation the Data Tab will take precedence.
For class variables, the Transformations node supports binning of rare levels and a number of encoding methods as shown below:
Let’s take a peek at our class inputs options.
If we select Bin rare nominal levels, we can choose the rare cutoff value percentage. By default it is 0.5 percent.
If we choose weight of evidence encoding, each level of a class input variable is assigned the weight of evidence of the target event for that level; only available for binary and nominal targets. The default WOE adjustment value is 0.5, and you can change this in the Options pane.
To help evaluate your transformations, the Transformations node can create summary statistics for each variable after the transformation is applied. Summary statistics require two additional passes through the data.
You can also use the Explore Variables Node to see histograms and other information about your new transformed. See my previous article on The Usefulness of the Data Exploration Node in Model Studio for more on that.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.