10-15-2013 02:25 AM
I'm using SAS Enterprise Miner to perform clustering of customers on specific dataset at one extraction date. After finalize the model, I wanted to score a new dataset to existing clustering model (k-means in SAS-EM). I found out that there are 80% of record classified as no cluster assigned. Only few of them there were be able to scored with segment number attached.
I supposed it is from the data transformation part that I did binning continuous data to nominal data within SAS-EM. But I have reviewed all of the variables of the model and all the ranges were covered and there is no missing value in any cell of the table used to model and score.
So, I wondered if it is the algorithm limit that they treat 80% of scoring data as outlier that SAS-EM cannot assigned any segment to specific cluster from the training model?
Is there any point that I can force the scoring cutoff point to assign all the scoring record data to the nearest possible cluster? Thus, all the scoring record can have a segment label assigned.
10-16-2013 03:28 PM
I checked with a SAS Education Specialist on this and here is some insight.
If the score data set has a nominal variable with a code that did not exist in the training data, then the observation associated with the “new” code will not be assigned to a segment. That is, it will be assigned to the “missing” segment.
Binning should not cause this sort of nominal code mismatch. However, you need to attach the score data Input Data Source node directly to the Score node. If you attempt to bin the score data separately with the Transform Variables node, you are likely to get different bins, and there will almost certainly be a nominal code mismatch.
In general, it is considered unwise to use nominal variables in K-Means clustering, and binning would not be recommended.
I hope this is helpful.
10-17-2013 10:56 PM
Thanks for your help in clarifying thing.
I'm now figure out the problem already. There is one variable from the source table before I combined them to analytical base table that were include byte(13) and the scoring table was using difference ETL script that already compress the text. Thus, EM treat this two variable as difference so leading to un-match of the data between the ABT and scoring data.
After investigated in the SAS scoring code within EM I found this error and now everything is fixed.