Three new Variable Clustering features in SAS Model Studio 8.3

7 Likes

What’s new in Variable Clustering in SAS Model Studio 8.3? We have added three features that greatly extend the functionality of the Variable Clustering node, a Data Mining Preprocessing node that can be used for dimension reduction. In this article, I discuss these new features:

Export cluster component
Export class level indicators
Extract optimal cluster configuration

Export cluster component

In the previous release, Model Studio 8.2, one variable is selected from each cluster and kept as input in succeeding nodes (the other variables in the cluster are rejected). But now, in 8.3, we have a Cluster representation property where you can choose to export the first principal component for each cluster (property value “Cluster component”). Note that the default value “Best variable” provides the same functionality as described above in 8.2.

With the “Cluster component” option, the first principal component is extracted from all variables in a cluster and output as new variable _CLUSn (e.g. _CLUS1, _CLUS2, _CLUS3, etc.), and the original cluster variables are rejected. The total number of generated component variables corresponds to the number of identified clusters.

Let’s use Home Equity data (HMEQ) as an example. This table is in the Sampsio library that SAS provides, accessible through SAS® Studio. It contains credit line information for mortgage applicants, such as debt-to-income ratio, requested loan amount, number of credit lines, etc. When running Variable Clustering against this data, one variable cluster is identified in the result. The number of variables that are clustered depends upon how the clustering properties are specified, but for this example, two variables are clustered: MORTDUE and VALUE. When choosing to export the cluster component, component variable _CLUS1 is generated containing the first principal component for MORTDUE and VALUE. Additionally, MORTDUE and VALUE are flagged as REJECTED in the output metadata. Here is a capture from the output variable metadata in the variable clustering results:

Here is the calculation for _CLUS1, as captured in the node score code of the variable clustering results:

Here is a small sample of the variable clustering output data, which includes component column _CLUS1 (Cluster Component 1):

Note that the output data still contains the clustered columns MORTDUE and VALUE. These are flagged as REJECTED in metadata, so succeeding nodes in your pipeline will not see those two variables as input.

Export class level indicators

The variable clustering routine clusters numeric variables, not character variables, so how are character variables, or class variables in general, handled? In a process that’s often called class variable one-hot encoding, each class variable is broken out into individual class level indicators (also called dummy variables) containing a value of 0 or 1. Each indicator is treated separately by the clustering routine. Note that the class variable clustering process described here is the same as that in 8.2. In 8.3, we have added the Export class level indicators property, which is enabled when the Include class variables property is selected along with “Best variable” for the Cluster representation property (below). When selecting to export the class level indicators, the following is accomplished: Generate the variables corresponding to the class level indicators, flag those variables as input to succeeding nodes, and flag the original class variables as rejected. Note: The export of class level indicators is always accomplished when selecting “Cluster component” for the Cluster representation property and when selecting to include class variables.

Let’s revisit the hmeq example. Here (see above), we choose to represent clusters with the best variable, we include class variables, and we export the class level indicators. In this data we have five class input variables: DELINQ, DEROG, JOB, NINQ, and REASON. The following is the class level information provided by the clustering routine (PROC GVARCLUS) in the results:

And below is a partial capture of the Cluster Summary results for the clustering process. Notice that each Class variable level is identified and treated as a separate entity.

The class variable clustering information shown here is the same as that in 8.2. However, since we are now generating the corresponding indicator variables, we have added a Class Variable Mapping table to show the levels for each class variable and the corresponding variables that are generated:

The clustering process results in four clusters: CLUS1, CLUS2, CLUS3, CLUS4. The clustered variables table (below) references the generated variables, rather than the class level indicators as done in 8.2. Notice that the clusters don’t include any interval variables: this is a function of the data. Interval variables can and do get clustered with class variables (or rather, their indicators), but since the individual class levels of a variable are highly correlated, they typically cluster together in earlier steps than with interval variables. To select each cluster’s best variable, a first principal component analysis is accomplished (using PROC PCA) to extract the loading factors for the variables in the cluster. The variable with the highest absolute loading factor is selected. For ties, the variable is chosen based upon name ordering. The other variables are rejected.

In this example, DELINQ_1 is rejected, JOB_ProfExe and JOB_Office are rejected, NINQ_1 and NINQ_2 are rejected, etc. The partial capture of the output variable metadata result table (below) shows the output metadata for some of the variables. Notice that for DELINQ and its indicator variables, DELINQ, the original class variable, is rejected, along with DELINQ_1. All the other DELINQ indicator variables remain as input since they are either used as the cluster representative or not included in a cluster.

A small sample of the output data for this hmeq example shows the class variable JOB and its indicator variables. Note that, although they exist in the output data, JOB, JOB_ProfExe, and JOB_Office are rejected in metadata, so succeeding nodes will not use these variables as inputs.

A final note: The process of selecting the best variable when NOT exporting class level indicators is different from what I have described here. In that case, a class variable can only be rejected if all its levels are included in a cluster. This is also the behavior in 8.2 when including class variables in the clustering process.

Extract optimal cluster configuration

In the 8.2 release, once the variable clustering process is kicked off, it continues until a stopping criterion threshold is reached or exceeded. The final cluster configuration when the process stops is the resulting cluster configuration for the node. Now, say that cluster processing stops at the seventh step, but the optimal cluster configuration (based on the penalized log-likelihood) occurs at the fourth step. We have added a new property, Optimal cluster selection method, which enables you to use the optimal cluster configuration at step four. This property has values of “None” (default) or “Penalized log-likelihood”. “None” provides the same behavior as that described above in 8.2. “Penalized log-likelihood” provides the automatic selection of the optimal cluster configuration.

In the hmeq example above, when running with penalized log-likelihood cluster selection, the clustering stops at the sixth step, but the optimal configuration is achieved at step two:

The optimal configuration at step two is the clustering of MORTDUE and VALUE. If you were to set the property to “None”, the cluster configuration at step six would be used. The cluster at this step includes four additional variables: CLAGE, CLNO, LOAN, and YOJ.

The optimal selection criterion, “Penalized log-likelihood”, calculates the penalized log-likelihood value at each step; the step with the minimum value is flagged as that with the optimal cluster configuration.

The resulting clustered variables table (below) shows that the clustering configuration at step two is used:

The resulting output variable metadata (below) shows that MORTDUE is rejected, since it was not selected.

Summary

In this article, I have given an overview of three new Variable Clustering features in Model Studio 8.3, and I have provided examples to illustrate how they are used. Here are the main points:

Along with providing the option to select the best variable for a cluster, property Cluster representation is now available to extract the first pricipal component and export as a new variable (_CLUS1, _CLUS2, etc.).
Property Export class level indicators is now available to export the class level indicators used in the clustering process (when including class variables). The new class level indicator variables, if clustered, are analyzed along with the other variables in their clusters to determine the cluster representatives. The original class variables are flagged as rejected, being replaced by the indicator variables in succeeding nodes.
Property Optimal cluster selection method has been added which enables the analysis of the cluster at each step of the clustering process, and then determines the step that contains the optimal cluster configuration (based on the penalized log-likelihood). That cluster configuration is the one that’s used for variable selection.

abidi · ‎07-21-2019

Excellent explanations and these new additions will remarkable and make life easy for data scientists.