## Variable distribution and diversity measure for selection

# Variable distribution and diversity measure for selection

I am trying to find an alternative way of checking the diversity in values of a variable. My goal is to set a measurement or weight that will indicate how well the variable is differentiated in its values and isn't characterized of lets 70% or 80 % of the same value. (another example would that variable X has 503 distinct values in 2000obs which i guess is good)

My goal is to select based on that measure variables for segmentation modeling , cause i believe they can discriminate my data well.

I am looking for something besides Proc univariate, means for stats / Varclus or PCA for variable selection, any idea?

Re: Variable distribution and diversity measure for selection

NLEVELS option in proc freq will tell you how many distinct values per variable.

It sounds like you're looking for a method more than a proc to me, at first glance, ie a uniqueness measure.

Not having any science behind this, I'd consider looking at percent of unique values, ie 503/2000 is about 25% uniqueness, assuming equal distribution which is unlikely.

I have to do this in a few weeks for something I'm working on, so if you find something else that works please post back!

Re: Variable distribution and diversity measure for selection

Hi Reeza,

I have used proc freq with nlevels and ok it gives me an indication which at this point is i would say ok.

I agree this is not about procs maybe something that could be set up by coding.

I will keep you updated on the matter via this post, my thought was something like entropy weights which give an indication of diversity within a variable but maybe i am assuming that for wrong types of variables.

Re: Variable distribution and diversity measure for selection

For segmentation you need variables which have more variability as well as uncorrelated. Otherwise solution will not converge.

Re: Variable distribution and diversity measure for selection

eMiner does some of this automatically as do a lot of the auto datamining software. My plan was to look into their classification methods and decide how I wanted to do mine

Re: Variable distribution and diversity measure for selection

Sounds like a plan, but i will try something in coding, i dont trust Eminer so much

Re: Variable distribution and diversity measure for selection

Whats not to trust about eMiner?

It's not really a black box tool and definitely requires user experience in both the tool and statistical methods.

Re: Variable distribution and diversity measure for selection

I have noticed many bugs and wrongs in Eminer computations, its mostly good to use when the analytical record is set and ready after coding, kinda to use it for predictive modeling (and model comparison) or segmentation, time efficiency is what it offers mainly

Re: Variable distribution and diversity measure for selection

Can you expand on the bugs/wrongs? I'm getting ready to use eMiner for a large, important project and am highly interested if there's a reason I shouldn't be.

Re: Variable distribution and diversity measure for selection

Depends, what type of project is it?

Re: Variable distribution and diversity measure for selection

Fraud detection is the general purpose.

Re: Variable distribution and diversity measure for selection

Hi Chemicalab,

If you think there are bugs in your system, talk to your SAS Admin or to Tech Support.
Make sure that the hot fixes you need have been applied. Feel free to google about your EM version, e.g. google "SAS Enterprise Miner 12.1 Hot Fix" and see what is there.
We find the bugs first than our customers for the most part .

Good luck!
-Miguel

Re: Variable distribution and diversity measure for selection

Wow, great post everyone, very lively today!  With respect to any bugs or issues with Enterprise Miner, SAS Tech Support is a great resource for troubleshooting.  With respect to coding inside EM, there are many options to customize your flows, including the Code Node, Transformations node, etc...  As Reeza stated earlier, some of the finer features may require training and experience.  There are many, many macros and macro variables available to add to your coding experience.

Re: Variable distribution and diversity measure for selection

Will make sure to do that, regarding the initial question of this post, Reeza i will get back to you on the diversity measure i think i am on to something but requires some coding, it will be working towards the entropy weights i mentioned earlier, will let you know on it

