Scoring Clustering Models in SAS Viya

1 Like

The purpose of this article is to show how to generate score code for unsupervised learning nodes in Model Studio like the Clustering node and score data in SAS Studio. Currently in a Data Mining and Machine Learning project in Model Studio, you can deploy the score code only for a predictive model (that is, a branch of the pipeline that includes a Supervised Learning node). But perhaps you want the score code from the Clustering node, or Anomaly Detection node which uses an unsupervised learning method (not involving the target variable). After clustering is performed using Clustering node and you determine the clusters, you would like to apply the "rules" to a different data set. For example, if it is computationally infeasible to perform the cluster analysis on the whole population in your system due to the large amount of data, you want to score all the observations and assign them to the preliminary clusters directly in the first stage. Or you might want to deploy your cluster analysis on an altogether different data set. In all these cases, no clustering iterations are performed to determine the cluster membership. Thus, it greatly reduces the need of computer resource and computation time.

The narrative that follows assumes that you have already created a pipeline in Model Studio. Pipelines are structured flows of analytic actions. These analytic actions are represented as individual nodes in a pipeline. (Learn more about Building Models with SAS Model Studio | SAS Viya Quick Start Tutorial.)

Scoring Data Using Clustering Models

You may choose to score your data in Model Studio or outside of it. This may depend on the size of your scoring table, scoring environment and / or your preference of using a GUI based scoring or write your own program.

Scoring data in Model Studio

To score data in Model Studio, you don’t necessarily need to have the score code. Just connect a Score Data node to Clustering node as shown below:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The Score Data node is a Miscellaneous node that enables you to score a data table with the score code that was generated by the predecessor nodes in the pipeline. The scored table can be saved or promoted to a CAS library. This is a straightforward approach and doesn’t require any prior knowledge of coding.

Scoring data outside of Model Studio

For any reasons, if you wish to score data outside of Model Studio, say in SAS Studio then you need to have the score code in first place. Currently, in Viya 4 the Clustering node results does not include the score code. To obtain score code from Clustering node, you have following options –

Option 1:

Use a SAS code node to simulate a column of target predictions and move the SAS Code node to the Supervised Learning group. The steps to follow guide you through the process of extracting the score code and finally scoring the data in SAS Studio.

Consider you have already created a pipeline to segment your data. Right-click the Clustering node and select Add child node > Miscellaneous > SAS Code. Your pipeline should look like the one below:

Now in the newly added SAS Code node, you can include code to simulate a column of target predictions.

Click the Open Code Editor button in the SAS Code node.
Ensure that your cursor is in the Scoring code window and write the code to simulate a column of target predictions.

Note: If you do not have a true target in your data, you can either create a pseudo one or use another variable in your data set that is not used as an input for the Clustering node. In PVA data set (used in this example) a target variable (Target_B) is already present.

In the upper right corner of the window, click the Save icon to save the SAS code and then click the Close button to close the Editor window.
Right-click the SAS Code node and select Move > Supervised Learning.
Notice that the two things happen. First, the SAS Code node changes from yellow (which is the color of Miscellaneous nodes) to purple (the color of Supervised Learning nodes). Second, a Model Comparison node is automatically added to the pipeline, connected to the SAS Code node.
Click the Run Pipeline button.

You now can deploy your score code in various ways (register, publish, download from the Pipeline Comparison tab), just as you would for a supervised model.

Right click the SAS Code node and select Download Score Code.

When you download the score code from SAS Code node, the resulting zip file will be saved on the client computer and location depends on the browser used. For example, Google chrome will save the zip file to the system download folder on client machine. This zip file contains epscore code sas file that will be referenced while scoring a data set in SAS Studio.

Again, right click the SAS Code node and select Results.

Note that the score code is accessible in the Path EP score code window. It displays the SAS code that was created by the node if there are analytic stores that are generated in the pipeline. The score code can be used outside the Model Studio environment to score new data. The xxxxxxx.setKey in the method init method block contains a string that identifies an analytic store. In this case, the astore file '_B85BU2NJNVFZH8XF74QX4G6O5’ can be located in the Models library of your CAS server. The string will be different in your case

In the Models library, it is saved as ‘_B85BU2NJNVFZH8XF74QX4G6O5_AST.sashdat’.

Note: - The astore file automatically gets saved in Models library only when you choose to download the score code.

This analytic store binary table is combined with data in PROC ASTORE to perform scoring in SAS Studio.

Launch SAS Studio, and submit following program to start a CAS session and assign libraries.

%let homedir=%sysget(HOME); %put &homedir;
cas;
caslib _all_ assign;

Next load your analytic store table from the Models library by processing the following piece of code:

proc casutil;
load casdata= “_B85BU2NJNVFZH8XF74QX4G6O5_AST.sashdat"
incaslib="Models" casout="cluster_astore" outcaslib=casuser;
quit;

Upload the epscore code (saved in the system download folder on client machine, say) into a location accessible from SAS Studio, which for me was on the CAS Server. I selected a folder on CAS Server (visible from the Explorer menu in SAS Studio), right clicked, and selected upload files.
Run PROC ASTORE with score statement pointing rstore to the astore file saved in casuser library from step 11 and epscore code to the location accessible from SAS Studio from step 12.

proc astore;
score data=casuser.pva
rstore=casuser.cluster_astore
epcode= '/greenmonthly-export/ssemonthly/homes/a.b@sas.com/dmcas_epscorecode.sas'
out=casuser.cluster_scored;
run;

A snapshot of the output table is produced below. It contains _CLUSTER_ID_ column that holds the cluster membership of each record. Also, note the IMP_DemAge column that shows the imputed column for DemAge variable. It is this data preprocessing step (imputation in this example) that is accomplished through the epscore code file while scoring a new data.

Option 2:

To obtain the score code of a Clustering node in Viya 4, you can connect a Score Data node to the Clustering node and run your pipeline. Your completed pipeline should resemble the following:

The steps to follow shows how to extract score code from Score Data node and perform scoring in SAS Studio.

Right click the Score Data node and select Results. Note that the score code is accessible in the Path EP score code window. To use Path EP Score code file for scoring, you need to download the file.
Click on the Download icon at the top of the window to download the file. Note that the Path EP Score code file gets downloaded as a text file in the browsers download folder (Google chrome). The astore file is stored in the project cas library. In this case, the file '_B85BU2NJNVFZH8XF74QX4G6O5_AST ' can be located in the project caslib "Analytics_Project_6c4541e7-b4d1-412c-ad79-ffa8617ab294". The astore file is not saved in the Models library like in option 1 discussed earlier. The caslib name here appears to be somewhat unusual and you must be wondering from where to get the name of the project caslib. Well, its bit tricky. To get the name of project caslib, you need to download the log file corresponding to Score Data node.
Again, right click the Score Data node and select Log. This will allow you to access the log file. Search for the text caslib and notice the caslib name as shown below:

Next, launch SAS Studio and submit code to start a CAS session and assign libraries.

%let homedir=%sysget(HOME); %put &homedir;
cas;
caslib _all_ assign;

Now I use CAS procedure to copy the astore to the CASUSER CAS library from the project caslib and renamed the astore as cluster_ast.

proc cas;
table.copyTable /
              casout={promote=true,name="cluster_ast",
             caslib="CASUSER"}
              table={name="_B85BU2NJNVFZH8XF74QX4G6O5_AST",
              caslib="Analytics_Project_6c4541e7-b4d1-412c-ad79-ffa8617ab294"};
run;

Upload the Path EP Score Code text file that you downloaded in step 2 into a location accessible from SAS Studio, which for me was on the CAS Server. I selected a folder on CAS Server (visible from the Explorer menu in SAS Studio), right clicked, and selected upload files.
Run PROC ASTORE with score statement pointing rstore to the astore file saved in casuser library from step 5 and Path EP Score Code file to the location accessible from SAS Studio from step 6.

proc astore;
score data=casuser.pva
rstore=casuser.cluster_ast
epcode= '/greenmonthly-export/ssemonthly/homes/a.b@sas.com/Path EP Score Code.txt'
out=casuser.cluster_scored1;
run;

A snapshot of the output table is produced below:

Of the two options discussed for scoring data in SAS Studio, I find the first method to be efficient and less error prone. This is because the second method requires the name of project caslib which is tricky to find out. Also, there are chances that users may inadvertently delete the files within the project’s caslib directory.

Find more articles from SAS Global Enablement and Learning here.

JThompson · ‎01-04-2024

Nice article, Manoj!