From Categories to Numbers: Transforming Categorical Variables Using CATTRANSFORM Procedure

Machine learning models work primarily with numerical data. They identify patterns, relationships, and trends using mathematical computations. However, in real‑world datasets, not all information is stored in numeric form.

Variables such as gender with values, “Male” and “Female", or customer types like “Basic", “Silver", and “Premium” are examples of categorical data. While these categories are easy for humans to understand, machine learning algorithms cannot process them directly. To use such data in machine learning models, the categories must first be converted into a numerical format that the algorithms can interpret effectively. This creates a gap between real‑world data representation and model requirements.

Transforming categorical variables into numerical representations is a crucial preprocessing step in building effective machine learning models. In SAS Viya, this transformation is efficiently handled using the CATTRANSFORM procedure.

Why Transform Categorical Variables?

Categorical variables must be transformed because machine learning models cannot directly interpret text labels. If raw categories are used or assigned arbitrary numeric values, the model may incorrectly assume relationships or ordering, leading to misleading patterns and reduced accuracy. Additionally, high‑cardinality variables increase complexity, slow down training, and can lead to overfitting. Therefore, transforming categorical data into appropriate numerical representations ensures that models process the data correctly and perform effectively. For example, consider a categorical variable such as Customer Tier with values 'Basic', 'Silver', and 'Premium'. Now using arbitrary numeric encoding results in following numeric values-

Customer Tier	Encoded Value
Basic	1
Silver	2
Premium	3

When categorical variables are assigned arbitrary numeric values, the model begins to interpret these numbers in ways that may not reflect reality. For example, if 'Basic'. 'Silver', and 'Premium' are encoded as 1, 2, and 3, the model assumes an inherent ordering where Premium is greater than Silver, and Silver is greater than Basic. It may also interpret a numeric distance between categories, treating the difference between Premium and Basic as meaningful. As a result, the model might incorrectly learn patterns such as 'Premium' is thrice as important as 'Basic', even though this relationship does not actually exist. This artificial ordering and implied magnitude can bias the model, leading to incorrect relationships, poor predictions, and ultimately unreliable results.

Note: Label encoding works particularly well when the categorical variable has a natural ordinal relationship, such as “Low", “Medium", and “High", or different levels of education.

In contrast, using the one‑hot encoding method transforms the categories into the following representation:

Customer Tier	D_Basic	D_Silver	D_Premium
Basic	1	0	0
Silver	0	1	0
Premium	0	0	1

One‑hot encoding approach eliminates the false ordinality problem entirely. Since no numerical hierarchy is imposed, the model does not assume that one category is greater or more important than another. This allows the model to interpret the categorical information correctly based on the data rather than imposed numeric values. As a result, one‑hot encoding provides a representation that is well suited for most machine learning algorithms, enabling them to learn meaningful patterns without distortion.

What is CATTRANSFORM Procedure?

PROC CATTRANSFORM in SAS Viya is a data preprocessing tool used to transform categorical variables into formats suitable for machine learning models. It provides flexible methods for handling categorical data, including both supervised and unsupervised binning techniques to reduce high cardinality and improve model performance. The procedure can group similar or rare levels, apply tree‑based and weight‑of‑evidence methods, and perform one‑hot encoding to convert categories into numerical representations. It also handles missing values by placing them into separate bins, ensuring no loss of information. Overall, PROC CATTRANSFORM helps simplify categorical data, improves computational efficiency, and supports more accurate and reliable modeling.

Transformation Methods supported in CATTRANSFORM Procedure

Choosing the right transformation method in PROC CATTRANSFORM depends on both the structure of the data and the modeling objective. The following methods are available:

Unsupervised Grouping (GROUPRARE): It is most useful when the primary concern is reducing complexity, especially in datasets with high‑cardinality categorical variables. Since this method does not rely on the target variable, it is ideal in the early stages of data preparation or when the goal is to simplify categories by combining rare levels without introducing modeling bias.
Supervised Grouping: Supervised grouping should be preferred when the relationship between the predictors and the target variable is important. Techniques such as tree‑based binning and weight‑of‑evidence grouping use the target variable to intelligently combine categories that behave similarly with respect to the outcome. This makes them particularly valuable in predictive modeling scenarios, where improving model performance is a priority. Unlike unsupervised methods, these techniques consider the target variable.

1. Tree‑Based Binning
  1. Classification tree -> for categorical targets
  2. Regression tree -> for continuous targets
2. Weight Of Evidence (WOE): requires categorical target and defined event level.

Tree‑based binning is commonly used in predictive modeling tasks such as risk modeling and churn prediction, where capturing meaningful relationships with the target variable is important, while weight‑of‑evidence (WOE) is widely used in risk modeling and credit scoring to quantify the strength of association between predictors and the outcome.

One‑Hot Encoding: It is most appropriate when the objective is to convert categorical variables into a purely numerical format for algorithms that require numeric inputs. It works well when the number of categories is relatively small and there is no inherent ordering among them. However, it may not be the best choice for high‑cardinality variables, as it can significantly increase the number of features and computational cost.

Note: In SAS environments, where procedures can internally handle categorical variables using the CLASS statement, one‑hot encoding is often used when preparing data for external modeling workflows or when consistent preprocessing is needed across different systems.

Transforming Categorical Variables Using PROC CATTRANSFORM

This demonstration illustrates how PROC CATTRANSFORM can be used to group levels of nominal variables using a classification tree‑based binning approach. In this example, we use a dataset (PVA) from a charitable organization that aims to better target individuals for donation requests. By focusing on individuals who are more likely to donate, the organization can reduce spending on solicitation and increase the funds available for charitable work. The data set includes variety of information on its customers like demographic details, past donation amounts and frequency of donation etc. The challenge is to build a machine learning model that can accurately predict the likelihood of donation (Target_B). Among the input variables, two categorical predictors require special attention: DemCluster (Demographic Cluster), which has 54 levels, and StatusCat96NK, which has 6 levels. To make these variables more suitable for modeling, they will be binned to reduce their complexity and enhance their effectiveness as predictors.

Launch SAS Studio and submit following program to start a CAS session and assign libraries.

cas;

caslib _all_ assign;

The following statements use PROC CATTRANSFORM to perform supervised binning by using the classification tree algorithm.

proc cattransform data=public.pva evaluationstats;

input demcluster statuscat96nk;

target target_b / level=nominal event='1';

method tree;

output out=public.score;

savestate rstore=public.PVAmodel;

run;

The EVALUATIONSTATS option in CATTRANSFORM procedure requests evaluation statistics about the transformed variables. The INPUT statement specifies one or more variables as input for binning or encoding. All input variables are assumed to be nominal. In this example, the variables DemCluster and StatusCat96NK are selected for transformation. The METHOD statement specifies the binning or encoding method and any associated options. You must specify one of the following methods: GROUPRARE, ONEHOT, TREE, and WOE with their respective options. The TARGET statement names the target variable (Target_B) to use in the supervised tree-based and weight-of-evidence binning methods. You can also use the target variable to compute evaluation statistics. The target variable can be nominal or interval, and it cannot be listed in the INPUT statement. The OUTPUT statement creates an output data table that contains the transformed variables. Notice the RSTORE= option in the SAVESTATE statement that creates analytic store for the model and saves it as a binary object in a data table named PVAmodel. We will be using this analytic store in the ASTORE procedure to score new data.

Now, let's submit above piece of code and then examine the output to understand the results.

Two categorical variables were transformed using a tree‑based method that groups categories based on their predictive relationship with the target. In addition, Weight of Evidence values were computed. The table also shows that the transformation incorporates a Weight of Evidence (WOE) framework, where the WOE is defined using the ratio of non-events to events. This specification determines how the categorical levels are converted into numerical values that reflect their relationship with the target outcome. To ensure numerical stability, especially in cases where certain categories may have very few or no events, a smoothing adjustment value of 0.5 has been applied. This prevents extreme or undefined values during the WOE calculation.

Finally, the IV (Information Value) factor of 1 indicates that the standard scaling has been used for evaluating the predictive strength of the transformed variables. Overall, this output conveys that the categorical variables have been intelligently grouped using a tree-based method and then transformed into meaningful numerical values using WOE, making them more suitable for use in predictive modeling.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The variable transformation output provides a comprehensive view of how the categorical variables have been processed and how useful they are for predicting the target variable. In this case, both DemCluster and StatusCat96NK have been successfully transformed into new variables, each with a reduced set of categories or bins. The number of observations used in the analysis is 106,546, and no missing values are present, indicating that the transformation process has retained all available data without any loss due to missingness. The categorical variables have been grouped into five and two bins respectively, which simplifies their structure while preserving relevant information.

Looking at the predictive strength of these variables, the Information Value (IV) indicates that both variables have moderate predictive power. This suggests that while they are useful for explaining the target variable, they are not among the strongest predictors in the model. The Weight of Evidence (WOE) values provide additional insight into the direction of the relationship. A positive WOE for DemCluster indicates a higher concentration of non-events, whereas the negative WOE for StatusCat96NK suggests a stronger association with events. This directional interpretation helps in understanding how each variable contributes to the prediction.

The statistical tests further confirm the usefulness of these variables. Both the Chi-square and likelihood ratio (G²) statistics are very large, with extremely small p-values, indicating a strong and statistically significant association between each transformed variable and the target. This means the transformations have effectively captured meaningful relationships in the data.

In terms of model performance improvement, the residual sum of squares (or deviance reduction) shows that both variables contribute to reducing prediction error, with StatusCat96NK showing a slightly higher improvement than DemCluster. This is further supported by the relative variable importance values, where StatusCat96NK is identified as the most important variable, and DemCluster contributes at about 86% of that importance.

Overall, this output indicates that the transformation process has successfully simplified the categorical variables while retaining their predictive value. Both variables are statistically significant and contribute meaningfully to the model, with StatusCat96NK emerging as the more important predictor.

The table titled “Bin Details” provides a detailed view of how the categorical variables DemCluster and StatusCat96NK have been grouped into bins and how each bin contributes to predicting the target variable. It shows how the original categories have been combined and how those groupings behave with respect to events and non‑events.

For the variable DemCluster, the transformation has resulted in five bins. Each bin represents a grouping of several original category levels, as indicated by the “N Levels” column. For example, the first bin combines 4 levels and contains 6,193 observations, while the second bin combines 22 levels and contains a much larger portion of the data with 50,413 observations. This shows that the grouping process has consolidated multiple categories into fewer, more meaningful segments.

The “Event Count” and “Non‑Event Count” columns provide insight into how many observations in each bin correspond to the target outcome (event) and non‑event. By comparing these counts, we can understand how strongly each bin is associated with the target.

The “Weight of Evidence” (WOE) column summarizes this relationship numerically. Negative WOE values, such as in bins 1 and 2 for DemCluster, indicate that these groups contain a relatively higher proportion of events, while positive values, such as in bins 4 and 5, suggest a higher proportion of non-events. This transformation converts categorical information into a form that reflects predictive behavior, which is useful for modeling.

The “Information Value” (IV) column measures the contribution of each bin to the overall predictive strength of the variable. Higher IV values indicate bins that are more useful in distinguishing between events and non-events. For DemCluster, bin 5 has the highest IV among its groups, suggesting that it provides the most discrimination within that variable.

Looking at the second variable, StatusCat96NK, the transformation results in only two bins. This suggests that the original categories were grouped into a simpler structure. The first bin contains 28,512 observations with a negative WOE, indicating a higher proportion of events, while the second bin contains 78,034 observations with a positive WOE, indicating a higher proportion of non-events. The Information Value for the first bin is relatively higher compared to the second, showing that it contributes more to distinguishing the target outcome.

Overall, this output illustrates how PROC CATTRANSFORM has grouped categorical levels into bins, calculated event and non‑event distributions, and transformed them into meaningful numerical representations using WOE. These transformations help simplify categorical variables while preserving their predictive power, making them more suitable for use in statistical and machine learning models.

In the final step, the astore file created during the earlier analysis will be used to apply the learned transformations on a new dataset.

proc astore;

score data=Public.scorePVA

rstore=public.PVAmodel

out=public.pvaout;

quit;

The PROC ASTORE statement invokes the procedure and does not require any options. The SCORE statement enables you to score the data using the previously trained model. The DATA= public.scorePVA in SCORE statement names the input data table for PROC ASTORE to use. The OUT= public.PVAout specifies the output data table. The RSTORE=public.PVAmodel specifies the table that contains the analytic store used for scoring the data.

Successful execution of the code generates a scored output table namely PVAout. The PRINT procedure can then be used to display a sample of rows such as the first 10 observations from this scored table.

proc print data=public.pvapartout (obs=10) ;

run;

The scored table shows the transformed (binned or encoded) versions of the original categorical variables after applying the PROC CATTRANSFORM procedure. Each observation has been mapped to its corresponding transformed category, making the data ready for modeling or scoring.

Conclusion-

Effective transformation of categorical variables is not just a preprocessing step; it is a critical factor in building robust and high‑performing machine learning models. PROC CATTRANSFORM in SAS Viya provides a powerful and flexible framework for transforming and encoding categorical variables.

References

SAS documentation

Find more articles from SAS Global Enablement and Learning here.