04-24-2014 04:38 AM
I am trying to find an alternative way of checking the diversity in values of a variable. My goal is to set a measurement or weight that will indicate how well the variable is differentiated in its values and isn't characterized of lets 70% or 80 % of the same value. (another example would that variable X has 503 distinct values in 2000obs which i guess is good)
My goal is to select based on that measure variables for segmentation modeling , cause i believe they can discriminate my data well.
I am looking for something besides Proc univariate, means for stats / Varclus or PCA for variable selection, any idea?
Thank you in advance
04-25-2014 10:50 AM
NLEVELS option in proc freq will tell you how many distinct values per variable.
It sounds like you're looking for a method more than a proc to me, at first glance, ie a uniqueness measure.
Not having any science behind this, I'd consider looking at percent of unique values, ie 503/2000 is about 25% uniqueness, assuming equal distribution which is unlikely.
I have to do this in a few weeks for something I'm working on, so if you find something else that works please post back!
04-25-2014 11:59 AM
I have used proc freq with nlevels and ok it gives me an indication which at this point is i would say ok.
I agree this is not about procs maybe something that could be set up by coding.
I will keep you updated on the matter via this post, my thought was something like entropy weights which give an indication of diversity within a variable but maybe i am assuming that for wrong types of variables.
04-25-2014 12:35 PM
For segmentation you need variables which have more variability as well as uncorrelated. Otherwise solution will not converge.
04-25-2014 12:53 PM
eMiner does some of this automatically as do a lot of the auto datamining software. My plan was to look into their classification methods and decide how I wanted to do mine
04-25-2014 01:51 PM
Whats not to trust about eMiner?
It's not really a black box tool and definitely requires user experience in both the tool and statistical methods.
04-25-2014 03:17 PM
I have noticed many bugs and wrongs in Eminer computations, its mostly good to use when the analytical record is set and ready after coding, kinda to use it for predictive modeling (and model comparison) or segmentation, time efficiency is what it offers mainly
04-25-2014 03:20 PM
Can you expand on the bugs/wrongs? I'm getting ready to use eMiner for a large, important project and am highly interested if there's a reason I shouldn't be.
04-25-2014 03:32 PM
If you think there are bugs in your system, talk to your SAS Admin or to Tech Support.
Make sure that the hot fixes you need have been applied. Feel free to google about your EM version, e.g. google "SAS Enterprise Miner 12.1 Hot Fix" and see what is there.
We find the bugs first than our customers for the most part .
04-25-2014 03:25 PM
Wow, great post everyone, very lively today! With respect to any bugs or issues with Enterprise Miner, SAS Tech Support is a great resource for troubleshooting. With respect to coding inside EM, there are many options to customize your flows, including the Code Node, Transformations node, etc... As Reeza stated earlier, some of the finer features may require training and experience. There are many, many macros and macro variables available to add to your coding experience.
Product Manager - SAS Enterprise Miner
04-25-2014 03:39 PM
Will make sure to do that, regarding the initial question of this post, Reeza i will get back to you on the diversity measure i think i am on to something but requires some coding, it will be working towards the entropy weights i mentioned earlier, will let you know on it