BookmarkSubscribeRSS Feed
Ribbonjovi
Obsidian | Level 7

Hi All,

Just wondering if any of you know of any drawbacks of increasing the maximum number of class levels (to 10,000 in this case) for input variables in Clustering/ Segment profiling using SAS EM14.2

 

Many Thanks

Roby

 

2 REPLIES 2
DougWielenga
SAS Employee

The first question is whether or not you truly have a categorical variable.   If you have an integer valued variable that can take values from 1 to 7, it might be useful to treat it as a numeric if it makes sense to do so.  For example, if the value 3 is truly 3 times as much as 1  and 7 is 7 times as much as 1,  you have a numeric value for which the average makes sense even though there is no observation that could be 'average'.  This can happen with things like the number of children in a family where you have different integer values but nothing in between.  

 

The problem with treating these numeric values as categorical is that you lose several things:

  1 - the relative size information (since 4 is 4 times as much as 1)

  2 - the interpolation for values in-between (what if your data only has 1, 2, 3, or 5 kids?  you can't score an observation with 4 or 6+ kids since you didn't have any of those values observed)

  3 - if you have a binary target, you must have events and non-events for every observed level in order to avoid problems with optimization/estimation (look up quasi-separation in logistic regression)

 

If these values really do represent different groups (e.g. say zipcodes), then you run into a different set of problems:

   1 - the Pareto principle often applies where 80% of the data is located in 20% of the group levels

   2 - the larger the number of groups, the less data is available to estimate each one

   3 - parameteric models typically require far more estimated parameters to model a grouping variable (k-1 parameters for a variable with k levels)

   4 - very little data is available in most groups

   5 - it would take an excessive amount of time to look at hundreds of individual results let alone thousands

   

Take a grocery store example where you might have hierarchies to consider:

    * The large category (dairy, grain, meat, produce)

    * A finer category (e.g. for dairy, you might have milk, cheese, ice cream, yogurt, etc....)

    * An even finer category (you could break it out by brand name)

    * The finest category (break out by size and style -- e.g. skim milk 8 oz vs 2% milk 4 oz -- based on the SKU code)

Doing modeling at the SKU level is likely not helpful due to the sheer number of different SKU codes, but doing it at the top level (dairy vs grain vs produce) is too high-level to see any detail.  You need to choose an appropriate level for what you are trying to accomplish.  

Now consider that you are looking to do clustering or segment profiling.  These approaches are meant to allow for interpretation which means you need to understand why and how the groups differ.  Once groups have been distinguished on some key attributes, they are less able to be separated by others.   Additionally, interpretation relies on your ability to apply business knowledge to interpret the results.  You are not likely to have any chance of applying meaningful interpretation to 512 let alone 10,000 distinct groups.  

 

If your data is truly non-numeric, consider doing one of the following:

    * Use the levels themselves and any hierarchy that is involved to create the groups -- they are already grouped because you are talking about grouping variables!  Why use clustering which can only muddy the waters?

    * Identify meaningful subsets of variables to create clusters.  You can then use the 'cluster' built on each subset of variables to create a profile which uses many dimensions -- one for each subset.  This is much easier to interpret than an overall cluster solution.

    * Identify subsets of observations which might not naturally group neatly (e.g. high-value customers vs. low-value customers) and then cluster the larger groups separately to allow for different variables to be used in profiling.

 

In the end, the usefulness of a segment profile is tied to your interpretation which becomes difficult if not impossible when too many levels are involved.   


I hope this helps!

Doug

Ribbonjovi
Obsidian | Level 7

Hi Doug

Thank You for the detailed explanation of the problem.

 

"Identify meaningful subsets of variables to create clusters.  You can then use the 'cluster' built on each subset of variables to create a profile which uses many dimensions -- one for each subset.  This is much easier to interpret than an overall cluster solution.

    * Identify subsets of observations which might not naturally group neatly (e.g. high-value customers vs. low-value customers) and then cluster the larger groups separately to allow for different variables to be used in profiling."

 

I really liked these points of yours... and that's what I'm doing now.

 

 

Cheers

Roby

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1072 views
  • 1 like
  • 2 in conversation