Reducing Dimensionality of Nominal Variables in SAS Viya

2 Likes

The analytical challenges we face today do not deal so much with questions of whether we can, but of being selective in our choice of mechanism to solve a problem. When approaching a solution, ease of use, convenience and efficiency trump over complexity, technical gobbledygook and brute force, and this is exemplified by the increasing adoption of tools in many contexts. Here, I describe and introduce a specific tool, a SAS Studio custom step, which helps us tackle a problem known as the curse of dimensionality for nominal variables.

Introduction

Artificial Intelligence (AI) and Machine Learning (ML) pipelines require rich features to unlock insights, and those features comprise both numeric and categorical variables. Nominal variables are a subtype of categorical variables that represent membership in groups without any inherent order, ranking or numerical value. When used directly in machine learning, nominal variables pose computational, applicability and informational challenges. Let’s consider them.

Firstly, too many features cause problems like data sparsity, computational challenges and overfitting. Nominal variables contain multiple levels that get encoded into what are commonly known as dummy variables, where each variable indicates a binary state of 0 or 1 for one level. This increases the feature space which creates computational challenges, especially if you have multiple, granular nominal variables.

Also, many ML algorithms and implementations operate only on numerical data and require categorical data to be encoded before they are passed as arguments. The presence of nominal variables thus reduces the list of applicable ML techniques and constrains the opportunity to fit the most effective model.

Finally, nominal variables tend to hide redundant information. Levels of different nominal variables might overlap and describe very similar aspects, an example being that of retirees and senior citizens. Some nominal variables may be hierarchical in nature, such as issue and sub-issues, reducing the informational value of the top-level hierarchy. This redundancy leads to overfitting issues in downstream ML tasks and additional variable reduction effort to mitigate the same.

Approaches and methods

The curse of dimensionality is universal and affects both interval and nominal data. While Dimensionality Reduction (DR) techniques for interval data, a common example being Principal Component Analysis (PCA), predate those for nominal and seem to command more recall, techniques for nominal dimensionality reduction have made their way to the front due to implementation improvements. Here are some popular unsupervised techniques for reducing dimensionality in nominal variables.

Multiple Correspondence Analysis (MCA)
Logistic Principal Component Analysis (LPCA)
Categorical Principal Component Analysis (CATPCA)
Embeddings trained through deep learning methods

We focus on the first two which are available as options in a SAS procedure called Nominal Dimensionality Reduction (proc NOMINALDR). You’ll find that MCA and CATPCA are similar in the sense that they are both extensions of Principal Component Analysis. MCA is designed exclusively for nominal variables while CATPCA can handle mixed data types; also, the methods differ in the way nominal variables are initially processed prior to decomposition. Similarly, Logistic PCA(LPCA) can also be viewed as another extension of principal component analysis for binary (and sometimes categorical data) which models data through probabilities based on a Bernoulli distribution instead of squared error minimization of raw values assuming a Gaussian distribution. The fourth technique, deep learning embeddings, is a broad area with diverse implementations particularly suited for high-cardinality nominal variables.

Driving Adoption

A technical paper, available under References, describes the methodology behind MCA and LPCA in more detail. P rogrammatic interfaces alone, such as SAS procedures, may not suffice to drive broader analytics adoption. Visual tools enable a wider range of users to apply analytical methods without coding expertise and for this purpose, we make available a SAS Studio Custom Step. The Nominal Dimensionality Reduction SAS Studio Custom Step is a low-code component which wraps a call to the NOMINALDR procedure through a simple user interface, enabling analytics practitioners, even those without SAS programming knowledge to execute this step in their machine learning pipelines.

The location of the custom step (also available under References) is here.
A copy of the SAS program which is called by the custom step can be accessed here.

Once available, I shall also provide another location where the step is part of a bigger SAS Studio Custom Steps GitHub repository.

Watch the following video for a quick walkthrough of the custom step. Using its clearly defined input and output contracts, you can execute the step standalone or as part of a SAS Studio Flow which allows you to design a pipeline of multiple data transformation and feature engineering tasks, including nominal dimension reduction. You’ll also appreciate the About tab and the guided description of parameters in the custom step interface, which can be a useful educational aid.

To sum up,

Successfully executing your analytics strategy involves understanding and paying attention to factors beyond the basic capability that you seek. An accurate and robust machine learning model depends on rich features, which in turn need to be crafted after dealing with the curse of dimensionality. While analytical techniques and procedures have evolved for dimension reduction, thoughtful, guided and convenient exposure to those methods through low-code interfaces helps adoption by practitioners on a wider spectrum of skills and orientation.

References

Nominal Variables Dimension Reduction Using SAS®, Technical Paper, SAS Institute, Yonggui Yan, Dec 2025, https://support.sas.com/content/dam/SAS/support/en/technical-papers/nominal-variables-dimension-redu...
Nominal Dimensionality Reduction Custom Step, GitHub repository, https://github.com/SundareshSankaran/nominal-dimension-reduction
An introduction to the curse of dimensionality, https://en.wikipedia.org/wiki/Curse_of_dimensionality
Multiple Correspondence Analysis overview, https://en.wikipedia.org/wiki/Multiple_correspondence_analysis
An introduction to Logistic Principal Component Analysis, https://cran.r-project.org/web/packages/logisticPCA/vignettes/logisticPCA.html
Categorical Principal Component Analysis overview, https://link.springer.com/rwe/10.1007/978-3-319-69909-7_104643-1
Proc NOMINALDR, SAS Procedure, Documentation, https://go.documentation.sas.com/doc/en/pgmsascdc/default/casml/casml_nominaldr_toc.htm
Explaining Custom Steps using SAS Studio Assets, Gemma Robson, SAS Communities, https://communities.sas.com/t5/SAS-Communities-Library/Explaining-Custom-Steps-using-SAS-Studio-Asse...

sbxkoenk · ‎01-02-2026

Cool new procedure !

Nowhere does it state that this is about unsupervised dimensionality reduction. Any target variable (labels, for example) does not play a role. There is only an input space (only independent variables).

You are essentially constructing a new coordinate system, but for nominal variables (rather than interval inputs) ... without making use of a dependent variable (target variable / output variable).

PaigeMiller · ‎01-02-2026

What about using PROC PLS to perform a principal components analysis on either binary, categorical or continuous input (or any mix thereof)? If you make the x-variables identical to the y-variables in PROC PLS, you get PCA, and dimension reduction is then possible.

How would this compare to any of the above?

Sundaresh1 · ‎01-02-2026

@sbxkoenk , thank you, and yes, indeed, these methods are all about the Xs (the explanatory variables) and the target explained variable does not play a role. For supervised dimensionality reduction, available methods include Linear Discriminant Analysis, supervised UMAP and the delightfully named Linear Optimal Low-rank Projection (whose acronym, LOL seems appropriate coming off this festive season).

I'll go ahead and point out that the methods in the article fall under the category of unsupervised dimensionality reduction.

Sundaresh1 · ‎01-02-2026

@PaigeMiller , thank you for the reference to Proc PLS (Partial Least Squares). I agree it's worthwhile to examine the question of how far we can get with PLS. Let me expand exposure to this discussion and we'll get back.

PaigeMiller · ‎01-02-2026

Regarding supervised methods, PLS and variations such as PLS-DA (discriminant analysis), hazard analysis PLS and logistic PLS work as supervised methods (but those are not available in SAS as far as I know).

YongguiYan · ‎01-13-2026

@PaigeMiller

Regarding to the questions "What about using PROC PLS to perform a principal components analysis on either binary, categorical or continuous input (or any mix thereof)? If you make the x-variables identical to the y-variables in PROC PLS, you get PCA, and dimension reduction is then possible."

PROC PLS is designed for partial least squares regression, which seeks linear combinations of predictor variables that explain as much variation as possible in the response variables. In PROC PLS, the response (Y) variables must be numeric. Therefore, when predictors are binary or categorical, they can be specified in the CLASS statement, but they cannot be used as response variables in the MODEL statement (otherwise an error is returned). As a result, you cannot make the X-variables identical to the Y-variables when the inputs are binary or categorical specified in CLASS statement.
If response variable is numeric, PROC PLS can be used with the METHOD=PCR option to perform principal components regression, which applies PCA to the predictor (X) variables. In this case, categorical predictors specified in the CLASS statement are internally encoded as dummy variables (each level is represent as a binary indicator of whether an observation belongs to the level; other encoding methods are available see ‘GLM Parameterization of Classification Variables and Effects’). PCA is then applied to these encoded X-variables. Then the reduced variables by the PCA of X-variables are used to predict the numerical response variables by linear models. PROC PLS performs both dimension reduction and linear regression training.
In contrast, PROC NOMINALDR (using both MCA and LPCA methods) is specifically designed for reducing the dimensionality of nominal variables. Although nominal variables are internally encoded as dummy variables, PCA is not performed directly on the raw dummy-encoding. Instead, MCA applies appropriate normalization and standardization before PCA while LPCA applies a logic-based transformation prior to PCA and it is iterative. Both methods produce reduced-dimension variables for downstream analysis. These reduced variables can then be used in subsequent regression or classification analyses for either numeric or nominal target variables, using appropriate procedures.
The reduced variables produced by PROC NOMINALDR can be used as predictors in PROC PLS when the target variable is numeric. In this case, PROC PLS models the linear relationship between the reduced predictors and the numeric response variable.

PaigeMiller · ‎01-13-2026

PROC PLS is designed for partial least squares regression, which seeks linear combinations of predictor variables that explain as much variation as possible in the response variables. In PROC PLS, the response (Y) variables must be numeric. Therefore, when predictors are binary or categorical, they can be specified in the CLASS statement, but they cannot be used as response variables in the MODEL statement (otherwise an error is returned). As a result, you cannot make the X-variables identical to the Y-variables when the inputs are binary or categorical specified in CLASS statement.

Binary or categorical variables can be replaced by dummy variables as responses in PROC PLS. And then you use the same dummy variables on both sides of the PLS equation. So it can be done. Whether or not it is a good idea, I can't say.

YongguiYan · ‎01-13-2026

Binary or categorical variables can be replaced by dummy variables as responses in PROC PLS. And then you use the same dummy variables on both sides of the PLS equation. So it can be done. Whether or not it is a good idea, I can't say.

In PROC PLS, automatic dummy encoding of binary and categorical variables applies only to the predictor (X) variables through the CLASS statement. Response (Y) variables must be numeric, so I believe categorical variables must be manually encoded prior to using PROC PLS if one would like to use them as both X and Y variables.

Given this, if the goal is purely dimension reduction rather than regression, one could directly perform PCA on the dummy-encoded predictor variables without using PROC PLS.

When using PROC NOMINALDR, nominal or categorical variables do not need to be manually encoded as dummy variables. The procedure handles the internal encoding automatically.

PaigeMiller · ‎01-13-2026

I can think of two reasons why you still might want to use PROC PLS instead of PROC NOMINALDR.

You don't have access to SAS Viya, so you cannot run PROC NOMINALDR
PROC PLS has a built in feature which lets you replace missing values using the EM algorithm.