03-19-2014 12:18 PM
Would like to take your inputs on data manipulation before using the model node in SAS EM 7.1:
1) Flooring/Capping of Variables: Any takes on how to perform this flooring or capping of variables. One method would be to cap using p99/p95 and floor using p1/p5. Is one round of this manipulation sufficient to proceed with successive manipulations?
2) Variable Transformation: Right skewed --> Log transform (for interval variables). How to handle % change or ratio variables which are on either side of zero (0), range standardization ? Suppose there are over 70 to 80 interval variables of which say 50 may require transformation, then this process may seem tedious using formula builder of the transform node. Is there a way to make this process quicker?
3) Variable Selection: Is variable selection node generally used post variable transformation?
4) Interactive Binning: Grouping of interval variables into categories helps increase predictive power of the variable (grouping done using Gini Statistic). I am in favor of using this as my data has around 50% of the variables as interval, rest being ordinal or nominal.
5) Is the process mentioned in steps 1 to 4 the correct approach, before applying one or more model nodes?
Thanks a lot.
03-20-2014 12:01 PM
I can address some of these questions, but not all in detail.
1.) One way this could be done is with the replacement node. Set the node to accept a minimum or maximum value and "replace" all above or below that value with your desired number.
2.) I don't have much experience here. Range standardization sounds reasonable. As for 50 variables, you can highlight groups of variables in the "variable dialogue box" (by depressing the cntrl key) and apply existing or user defined transformation to those variables highlighted. This would be a posisble way, for examaple, to do a log transform on many variables in a single node. There is probably a way to automate this but I'm not sure how.
3.) In the AAEM course we do variable selection AFTER transformation, imputatoins, etc. If you have a huge number of variables and you know not all will be used in modeling, I can see using variable selection early on to reduce this large number. I don't think you can really go "wrong" here, but going with a course I am familiar with, I'd suggest doing variable preparation prior to variable selection.
4.) The transformation node has a binning transformation, but I have never used it. You could read up on it in the help documentation. I know the credit scoring node (which is only available with additional licensing) has some very good binnig options. But you need to have this additional node to use it.
5.) Yes, you have the basic "recipe" down. If you are in a credit scoring scenario I suggest getting feedback from someone who knows that particular area in detail. I do not specialize in credit data.
Hope this helps.
03-22-2014 06:12 AM
Thanks for your response. I would try out couple of options in the replacement node to check which one gives better results for flooring/capping of certain variables.
Using number of times the mean absolute deviation approach, I could get a replaced interval variable, which was definitely looking right skewed but had a spike in its right hand tail (may be because of large number of values being capped). Further a log transformation resulted in transforming this to a left skewed distribution, now the spike being on the left hand tail.
A colleague of mine suggests that post replacement of extreme values (flooring/capping), one should directly use the interactive binning node, which results in grouping interval variables into bins such that the Weight of Evidence measure is either monotonically increasing or decreasing.
Any thoughts/views from you and other community members would be appreciated.
Also, as regards using the variable selection node prior to transformation node/replacement node, I could see than out of 150 odd total variables. only a handful (<8) were getting selected and being displayed as being "grouped" once again.Would grouping through interactive binning be still necessary?
It seems like I would rather try 3 to 4 different ways of arranging and using these nodes in various sequences before I can zero in on one final approach.
03-27-2014 02:42 PM
Sounds like you may be in the credit scoring line of work, but that is a guess. Binning, such as with Weight of Evidence, is popular in this area. I do not have much first hand experience in credit modeling. But as you have heard, WOE can be a great thing to do. I have often seen using binned versions of variables as inputs in models.
Your last statement is why, in my opinion, prediction is easier than inference. When building a predictive model we have the benefit of a hold-out data set used for honest assessment. There is not one one-size-fits-all recipe for building a model. The best appraosh is to try out a few ideas, assess on hold out data and see which works best. This is the way to address the order in which actions could be done when modeling.