BookmarkSubscribeRSS Feed
jedakyu
Calcite | Level 5

Hi there,

I'm using SAS Enterprise Miner to perform clustering of customers on specific dataset at one extraction date. After finalize the model, I wanted to score a new dataset to existing clustering model (k-means in SAS-EM). I found out that there are 80% of record classified as no cluster assigned. Only few of them there were be able to scored with segment number attached.

I supposed it is from the data transformation part that I did binning continuous data to nominal data within SAS-EM. But I have reviewed all of the variables of the model and all the ranges were covered and there is no missing value in any cell of the table used to model and score.

So, I wondered if it is the algorithm limit that they treat 80% of scoring data as outlier that SAS-EM cannot assigned any segment to specific cluster from the training model?

Is there any point that I can force the scoring cutoff point to assign all the scoring record data to the nearest possible cluster? Thus, all the scoring record can have a segment label assigned.

Thanks,

3 REPLIES 3
AnnaBrown
Community Manager

Hi jedakyu,

I checked with a SAS Education Specialist on this and here is some insight.

 

     If the score data set has a nominal variable with a code that did not exist in the training data, then the observation associated with the “new” code will not be assigned to a segment. That is, it will be assigned to the      “missing” segment.

     Binning should not cause this sort of nominal code mismatch. However, you need to attach the score data Input Data Source node directly to the Score node. If you attempt to bin the score data separately with the      Transform Variables node, you are likely to get different bins, and there will almost certainly be a nominal code mismatch.


     In general, it is considered unwise to use nominal variables in K-Means clustering, and binning would not be recommended.

I hope this is helpful.

Anna


Join us for SAS Community Trivia
SAS Bowl XXIX, The SAS Hackathon
Wednesday, March 8, 2023, at 10 AM ET | #SASBowl

jedakyu
Calcite | Level 5

Hi abrown,

Thanks for your help in clarifying thing.

I'm now figure out the problem already. There is one variable from the source table before I combined them to analytical base table that were include byte(13) and the scoring table was using difference ETL script that already compress the text. Thus, EM treat this two variable as difference so leading to un-match of the data between the ABT and scoring data.

After investigated in the SAS scoring code within EM I found this error and now everything is fixed.

Thanks.

AnnaBrown
Community Manager

Excellent, jedakyu, I'm glad you resolved the issue!

Anna


Join us for SAS Community Trivia
SAS Bowl XXIX, The SAS Hackathon
Wednesday, March 8, 2023, at 10 AM ET | #SASBowl

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1979 views
  • 3 likes
  • 2 in conversation