RalphAbbey Tracker

Re: Silhouette coefficient

RalphAbbey — Wed, 17 May 2023 15:11:00 GMT

Rick Wicklin has recently written a wonderful 2-part blog post about this.

https://blogs.sas.com/content/iml/2023/05/15/silhouette-statistic-cluster.html

https://blogs.sas.com/content/iml/2023/05/17/compute-silhouette-sas.html

Hopefully this can help.

Re: savestate issue with decisiontree train action cas

RalphAbbey — Fri, 10 Feb 2023 21:13:35 GMT

How are you reloading the model when you come back and try to score?

As you have saved the savestate table as a sashdat, you will need to load it back into the new session appropriately. Based upon your example here is something that might be close?

proc cas;
   table.loadTable /
   path = "GB_YEARLY_KMS_V3.sashdat"
   caslib = "onair"
   casout = {name="GB_YEARLY_KMS_V3", caslib="onair", replace=true}
;
run;

Then afterwards the astore call should run as it did in the previous session.

Re: PROC FOREST - Directly Converting outmodel= dataset to analytic store dataset

RalphAbbey — Thu, 14 Apr 2022 13:21:27 GMT

That's what I get for not fully checking the doc! Thanks for the correction!

Re: PROC FOREST - Directly Converting outmodel= dataset to analytic store dataset

RalphAbbey — Wed, 13 Apr 2022 17:55:44 GMT

Hi! I think what you're looking for might be the "dtreeExportModel" action in the decisionTree action set. This can take models created using the outmodel option and convert them to the type that you'd see from the savestate statement.

I'm not 100% sure of your setup, but try something like this:

proc cas;

action decisionTree.dtreeExportModel / table = {name="RF_Model_V1"}, casout = {name="name_of_model_astore", replace=1}, vote="PROB";

run;

quit;

I put the "replace=1" there so that it will overwrite previous tables if they have the same name (use carefully). I also put vote="PROB", which is the default voting method for PROC FOREST. You will want to specify to ensure that the scoring method matches what you expect - probability voting amongst the trees in the forest as opposed to majority voting amongst the trees in the forest.

I hope this helps, and feel free to reach out if you have any additional questions!

Re: Vertical Line for recommended cluster size in CCC plot in Enterprise Miner

RalphAbbey — Thu, 02 Apr 2020 20:45:01 GMT

Hi Saturday,

Where was it that you previously saw the CCC plots with the thin blue line that recommends the number of clusters?

Re: Silhouette coefficient

RalphAbbey — Mon, 16 Mar 2020 18:02:42 GMT

To my knowledge the Silhouette Coefficient isn't calculated as part of any SAS procedure or SAS Viya action. When I wrote the SAS Global Forum paper my goal was to discuss a lot of the methods that people use to evaluate clustering, but not necessarily limit myself to methods that are supported within the software.

There are some definite down-sides to the Silhouette Coefficient in that 1) you shouldn't use it to compare two different types of clustering - it's very much related to centroid based clustering methods 2) it is a time consuming metric to calculate if you have large numbers of observations.

That being said, I think that one of the better ways to implement if you are looking to do so in SAS would be by using SAS IML. Because the Silhouette Coefficient looks across all pairs of observations the matrix setup of IML would be an easier programming approach for this type of problem (as compared to Data Step - which is still doable).

Re: LIME in SAS without Viya

RalphAbbey — Fri, 01 Nov 2019 15:03:12 GMT

Unfortunately to my knowledge the only way would be to write a SAS macro that performs the steps of LIME yourself. This might be a little short on details, but cover these steps cover tabular LIME:

1) Run an analysis to obtain the variance for each input variable

2) Choose an observation which you wish to run LIME on

3) Create N samples, where the observations are sampled from a Gaussian normal distribution with mean equal to the variable value for observation from 2) and the variance equal to the variance from 1)

4) Calculate weights for each observation using the RBF kernel between the observation in 2) and the generated observations in 3)

5) Score this new data set using the model

6) Run a weighted regression on the scored data set using the output of the model as the dependent variable and the inputs of the model as the independent variables

Let me know if this helps.

Re: Explain autoencoder prediction with SHAP

RalphAbbey — Mon, 23 Sep 2019 18:20:13 GMT

EduxEdux,

The linearExplainer action, which supports the SHAP Kernel method for estimating Shapley values does require a "predicted target." However, in your case, you have multiple "predicted targets" as the paper about Shapley values on autotunecoders suggests. While the autoencoder is trying to predict what each input variable is, the linearExplainer action wants to know the name of the output variable.

The only complication I can see is if the annTrain autoencode has the same output variable names as input variable names. As long as they are different names, derived or otherwise, then you should be able to use the linearExplainer action.

For the "predictedTarget" parameter, instead use the output variable name corresponding to the input variable that you wish to examine. As the paper says, you first need to consider the top variables which are different between input and output (and thus you'd need different names in SAS anyway, otherwise you'd overwrite). When you go to use the linearExplainer action, use that output variable name as the "predictedTarget".

Hopefully this is clear, let me know if I can help further!

-Ralph

Re: k means clustering in SAS

RalphAbbey — Tue, 13 Nov 2018 16:46:46 GMT

In regards to dimension reduction for the purpose of visualization, there isn't necessarily a correct or incorrect answer. You have identified two good techniques, but these techniques do something slightly differently. This will mean that your understanding of the plots that they produce need to be different.

Canonical Discriminant Analysis will use the cluster variable and create a projection that is based upon the cluster labels that you have assigned. That this means, is that CDA will try to find the linear combination of inputs that has the highest correlation with the cluster label. You can think of this as the "best" (given the metric used in CDA) projection of the data for the purpose of seeing what linear combination best separates the cluster labels.

Principal Component Analysis will not consider the cluster labels. This could be more useful if you want to see how the clustering looks in a lower dimension without using the cluster information to bias your projection. The projection of the data is not dependent on how you cluster, but is instead the "best" with respect to the variance of the data, so you can see the data, and then see how the cluster labels are distributed across your projected space.

Ultimately the dimension reduction methods answer slightly different questions, and what you're trying to with the dimension reduction and plotting should inform which route that you go.

I hope this helped!

Re: Provide a cost function for PROC HPSPLIT to let it deal better with unbalanced categorical data

RalphAbbey — Thu, 02 Aug 2018 13:53:56 GMT

Thank you for the suggestion on improvements to HPSPLIT. For the current time, I do want to suggest a possibility that might mitigate some of your concerns with more than 2 categories for your response (obviously, your original points are still valid).

You could try a 2 model process, where you first predict "normal" vs "not-normal." Then for the cases of "not-normal" you can try to predict "low" vs "very low." By grouping low and very low together for the first model, you are making the categories that are a little more balanced, and also only doing a binary prediction instead of more than 2 categories. For any observations for which you have predicted "not-normal" you can follow with a second model to try to predict what type of "not-normal."

While this is only a workaround, hopefully it can help!

Re: How to choose the best k among many in SAS?

RalphAbbey — Mon, 21 May 2018 15:44:37 GMT

One of the difficulties in determining the correct number of clusters is that intra-cluster similarity often increases as you increase k. This is because if you split a cluster into two smaller clusters, those smaller clusters will have a higher intra-cluster similarity than the one cluster they were derived from.

It's possible that a plot of the number of clusters versus intra-cluster similarity will have a change in steepness, when you've reached the number of clusters and start splitting good clusters. You would expect that the increase in intra-cluster similarity would be smaller when you split a good cluster as opposed to when you split a large bad cluster into two smaller clusters. However, this is just a heuristic, and can be difficult to determine by just looking at this plot.

The difficulty in determining the number of clusters is one of the large and still explored areas in clustering research. It's also why people really like methods such as dbscan, spectral clustering, or consensus clustering which seek to give the number of clusters during the clustering process.

From what you've mentioned in this post, I'd recommend a few possibilities:

1) If you have certain business rules that can help you narrow your search for the number of clusters, try to limit the search space this way first.

2) You can plot the intra-cluster similarity (on the y-axis) and the number of clusters on the x-axis. Look to see if the gains in intra-cluster similarity seem to taper off as you increase the number of clusters. This is a heuristic, and not guaranteed to happen, but if it does, that could give you an easy to see answer.

3) If you have a specific end use for the clusters, you can perform that analysis on each set of clusters, and pick the set of clusters that seem to give you the best results (this could overfit for your clusters though, and you might want a hold-out test set to help avoid over fitting)

4) While SAS does not have dbscan or some of these other methods I mentioned, some of them you can replicate using other procs and data step code. This is a bit more technical (requires deep understanding of the underlying clustering algorithms), and by far the most time consuming approach, but could provide useful insights if you have the time for it.

Hopefully this helps you get started.

Re: What are the characteristics of a good cluster?

RalphAbbey — Mon, 07 May 2018 15:20:17 GMT

It seems that you actually have two questions here: 1) How do I compare two clustering results to determine which is optimal 2) How do I determine the number of clusters is optimal.

While 1) can be related to 2), if you want to compare a clustering result with 3 clusters vs a result with 20 clusters, I will mostly address these separately. I will have some specific details in my answer, but also more general points. I hope both help!

1) "How do I compare two clustering results to determine which is optimal"

As mentioned, by Ksharp, the Analysis of Variance is a useful metric to use when considering clustering metrics. You can use PROC GLM in an Enterprise Miner code node to do this.

Ultimately though, as clustering is an unsupervised task (ie there is no target variable used), I find that the meaning of "optimal" in the case of clustering can be problem dependent (even if the data is the same).

The way I like to approach the question is by first asking "what is the goal of clustering" for the context of the problem you're working on (what do you want the clusters to help you do?). For example, if it's a predictive modeling problem in which you want to develop models on each cluster separately, then the overall accuracy of your models across all the data will let you know how good the clustering is.

2) "How do I determine the number of clusters when using clustering"

One way, if you have SAS Enterprise Miner 13.1 or later, is the HP Cluster node under the HPDM tab. This node has a metric called the "Aligned Box Criterion" which automatically seeks to find the number of clusters for you.

Another method is called spectral clustering, which is looks at the eigenvalues of a similarity matrix to try to determine the number of clusters. While this is not implemented in Enterprise Miner, SAS does have the procedures so that you could implement it yourself using a SAS Code node, with data step and a procedure to get the principal components, followed by kmeans.

----

Finally, an idea to address both questions that is much more involved, is consensus clustering (which can be used with the two previous ideas for determining the number of clusters). The goal behind consensus clustering is to ensemble multiple clustering results into one (including results with different numbers of clusters). The reasoning for why you would want to ensemble is that if multiple clustering results overlap, then you feel confident that the areas of overlap are "correct" / "optimal." Again, this is not implemented in Enterprise Miner, and is quite involved. That being said, it is possible to do using SAS data step code and the procedures / nodes in Enterprise Miner.

Hopefully some of this helps, either immediately, or by giving you things to think about.

Re: How to calculate the ROC Index Confidence Interval in Enterprise Miner 14.1

RalphAbbey — Wed, 02 May 2018 15:01:06 GMT

Unfortunately to my knowledge, Enterprise Miner does not compute confidence intervals for the ROC index.

I hadn't heard of confidence intervals on ROC curves before, so I decided to look it up. I found several different results, so unfortunately I'm not sure what method you need for your work.

Ultimately, if you know the equations behind the confidence interval calculation, then you can use a SAS code node to do the computations for the ROC index and the confidence interval after the modeling node.

I hope this helps guide your work. Let me know if there's anything I can help with here.

Re: How do i compare results of k-means, agglomerative heirarchical clustering & kohnen SOM ?

RalphAbbey — Wed, 02 May 2018 14:35:48 GMT

As mentioned previously, you can do an analysis of variance. I don't know if there is an Enterprise Miner node that does this either, but you can use the SAS code node to run the PROC GLM.

In general it can be hard to determine "best" clusters/segments. There are multiple measures, but some of them don't compare across types of clustering methods (centroid based vs hierarchical based). Ultimately it may be worth trying all the clustering methods, and then computing your analysis on each set. If your analysis is better on one set of segments/clusters than another, then that could be one way to determine "best."

If you're looking at the results of a final modeling, then you might need to consider a holdout set so that you're not biased in determining your best clusters. Ultimately approaching the definition of "best" in this way ties the definition of best clusters to the ultimate modeling results that you're looking for.

Re: Clustering in SAS Miner: Number of clusters determination, input data, and results interpretatio

RalphAbbey — Wed, 02 May 2018 14:28:20 GMT

Yingjian's post has some very good points. I wanted to add a bit more, and also some about clustering in Enterprise Miner.

In Enterprise Miner there is the "Cluster" node that is under the Explore tab. This node uses PROC CLUSTER to compute the clustering. In this node there is a Cubic Clustering Criterion (CCC) that attempts to determine the number of clusters while performing the analysis. In general, there aren't many ways to accurately gain a good view of how many clusters there should be a priori, unless information is known about the data before hand.

The best results (average, centroid, ward) requires your definition of best. In centroid based methods, many people will try to define best by looking at the total sum of distances from points to their respective centroid, but in non-centroid based methods this is no longer a useful measure. Ultimately I think the results of your later analysis may be how you want to determine which of the clustering results was "best."

Also available in Enterprise Miner 12.3 and on, there is the "HP Cluster" node that is under the HPDM tab. This node uses PROC HPCLUS to run kmeans clustering. PROC HPCLUS does have the Aligned Box Criterion (ABC) that Yingjian mentioned to determine the number of clusters. If you have the chance to try the HP Cluster node, you may find that it has some capabilities that you would find useful.

---

Also, to address the "best" number of clusters question, Yingjian is correct in that it is very difficult to say what number is best. Even defining what best means in the clustering context can be difficult.

If you are unsatisfied with the results of a single clustering, there is an approach that you can try called consensus clustering. This approach is to cluster the data multiple times and attempt to ensemble the results of all the clustering runs into one final clustering. Enterprise Miner has no node that will do this for you automatically, but you can do this using multiple PROC CLUSTER calls and some other data step code. This would require a SAS code node, but is an interesting approach if you're looking to do something more (it might require a bit of research to get started though).

Re: Enterprise Miner Decision Tree results in only one node as opposed to an entire tree of results

RalphAbbey — Tue, 12 Dec 2017 15:22:11 GMT

Have you also tried the HP Tree node? You lose some flexibility, such as the interactive decision tree, but it might also be something to explore.

Re: Proc HPSVM

RalphAbbey — Tue, 12 Dec 2017 15:15:23 GMT

I'm not 100% sure about how HPSVM calculates these numbers, but here is some information on the number of support vectors in general.

In a linear model, if the data is completely separable then the number of support vectors will equal the number of the support vectors on the margin. Being completely separable means that you can draw the hyperplane to completely separate the two classes.

In general the data is more mixed, and thus there is no hyperplane that will completely separate the two classes. Thus, the penalty parameter exists - to modify the optimization (This is the proc option C in HPSVM). At this point the number of support vectors are the number of vectors on the margin (both sides of the hyperplane), AND all the vectors in between the two margins. Thus there will be more support vectors, than the number of support vectors on the margin because the vectors in between.

Adjusting the penalty (proc option C) can adjust the number of support vectors in between the margins, but this also has an effect on the accuracy of the model.

I think there are some extra nuances with non-linear models and the number of support vectors, but the general idea still holds.

Hopefully this helps.

Re: Random Forests / Decision Trees: Counting the number of nodes per categorical target level

RalphAbbey — Thu, 30 Nov 2017 15:36:59 GMT

If you're using HPSPLIT for your decision tree, then you can use the "NODES" option. This generates a table with the name NODETABLE. You can save this ODS table and use if further analysis.

proc hsplit data=X NODES;

input...

target...

ods output nodetable=MyName;

run;

This table (MyName in the example above) contains an ID string for each terminal (leaf) node. It also includes the path from the root node to the leaf node. In addition it includes the proportion of the events at each node in the path.

Re: PROC HPSPLIT: is this decision tree tool good for categorising respondents?

RalphAbbey — Tue, 07 Mar 2017 21:23:29 GMT

I do agree with you that you shouldn't see a split as you've described for a variable with only the numbers 1-5. It would be hard for me to say anything more diffinitive without knowing more about what your procedure call was, or seeing the tree diagram that you saw.

Re: PROC HPSPLIT: is this decision tree tool good for categorising respondents?

RalphAbbey — Wed, 01 Mar 2017 15:44:02 GMT

The marked values are the values at which the split occurs.

Let us consider Node 0, the top of the tree. This represent the full set of observations. The two thick lines descending from Node 0 represent the split of the full set of observations into two smaller groups. That split is determined using the variable "Flav" and the value of 1.572. That is to say, all observations with Flav < 1.572 go into Node 1, while all observations with Flav >= 1.572 go into Node 2.

Node 2 is split again, this time using the variable "Proline" and the value of Proline which is used to split the set of observations is 726.640. Node 3 represents all observations that have Flav >= 1.572 AND Proline < 726.640. Node 3 is composed of 54 observations (This is the number corresponding to "N" on the node), and the Node is 98.15% made up of level 2 (which according to the legend is Cultivar=2).

Ultimately those highlighted numbers represent the value at which the split occurs for the variable (for continous variables). If you have nominal variables, instead of < and >=, the different levels well be indicated on the different splits.

Hopefully this helps!

-Ralph