SAS EMiner- Variable Selection

SGhosh · Posted 09-13-2017 05:19 PM

I am new to SAS EMiner, so any response on this would e very helpful

How can I select variables in the data node of SAS EMiner? To define it more clearly, I would say if I have ~50 variables, how could I select / determine the strongest variables for my model?

When I am running my model package, in the log I see "Pr > ChiSq" for most of the variables are 1.0000 - this explains my variables have some issue. But how to fix it. For example , I have a variable called claim_count and instead of keeping the values as continuous I grouped them in certain buckets. like 1-10,11-20.. etc

Thanks in advancre

DougWielenga · Posted 09-15-2017 11:37 AM

How can I select variables in the data node of SAS EMiner? To define it more clearly, I would say if I have ~50 variables, how could I select / determine the strongest variables for my model?

By 'data node' I am assuming you mean the Input Data Source node. In general, you would be ill-advised to remove variables from consideration unless you knew they were not (most likely) suitable for direct use in modeling (e.g. ID information, date/timestamp information, SKU numbers, zip codes, etc...). It is also not necessary to choose variables in this node since SAS Enterprise Miner provides a wealth of methods to choose variables for your model such as the following:

* the Variable Selection node provides Regression and Tree-based methods for choosing variables

* the Tree node performs its own variable selection so it does not need prior variable selection

* the Regression node allows you to add possible terms and to perform a set of stepwise methods to perform variable selection

* the Variable Clustering node provides and alternate way of trying to remove variables which have duplicate or highly similar information

When I am running my model package, in the log I see "Pr > ChiSq" for most of the variables are 1.0000 - this explains my variables have some issue. But how to fix it. For example , I have a variable called claim_count and instead of keeping the values as continuous I grouped them in certain buckets. like 1-10,11-20.. etc

Bucketing manually without any numerical evaluation might actually do more harm than good. SAS Enterprise Miner provides a variety of bucketing algorithms which can take into account the relationship to the target variable.

* the Transform Variables node allow you to create bucket with optimal relationship to the target variable (a Tree-based method)

* Interactive Grouping allows you to create groups interactively

Please note that bucketing summarizes information and can (possibly) result in a predictor that is less capable than the original data. The buckets, however, do provide the additional capability of helping to model non-linearity which might improve how the variable information can be used. In practice, I recommend including both the original interval variable and the bucketed version prior to variable selection so that the information is used in the best possible way.

Hope this helps!

Doug

SGhosh · Posted 09-15-2017 06:48 PM

This information is really helpful. I do appreciate this. I changed few things after reading the reponse and getting better result now.

Truly appreciate all your help

Thanks much

Soma

SAS EMiner- Variable Selection

Re: SAS EMiner- Variable Selection

Re: SAS EMiner- Variable Selection

SAS EMiner- Variable Selection

Re: SAS EMiner- Variable Selection

Re: SAS EMiner- Variable Selection

Ready to join fellow brilliant minds for the SAS Hackathon?