BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mohammad__101
Fluorite | Level 6

Dear Community Members,

 

If i have a set of variables 1K+ (raw variables + transformed ), will it help to perform Data impute and variable selection to come up with the Model uplift

I have a binary prediction Model to predict the Event occurnace (1/0). I use multiple Decesion Trees to compare which is the best fit .

I change the tree depth and the split criteria for each tree .

Will it also help to convert all my continious variables to bins (100 bin) for a better Model lifetime going forward ( If i have any variables that may increase by time ( life time as an exampel )?

 

Thanks for your help

1 ACCEPTED SOLUTION

Accepted Solutions
JasonXin
SAS Employee
Hi, Your title says "uplift". Do you actually mean just regular lift performance? Because uplift often refers treatment over control kind of incremental lift. Lift or uplift has different imputation and variable selection strategy. Assume you mean lift. 1. Your title says decision trees (DT). DT does not require missing value imputation. Also, imputation decision does not have to always go hand-in-hand with that for variable selection, although the two aften are made 'together'. 2. when number of input variables is large, a differentiation between V selection and V-Screening becomes increasingly needed. V-Selection traditionally has implication of "I have determined the best". versus V-screening refers to "roughly cut out the obviously weakest variables". To do the screening, you set the entry level/significance test values to be generous in using the DT node. It is not unusual for one to repeat several rounds of screening. You need to watch how many and what variables are left per each round of screening. Generally speaking, you should have more confidence in throwing out the obviously bad than picking the final elite. 3. Apparently you have settled on using Transformed as starting point of your question. I am not going to question that at this thread. If all possible, consider running screening BEFORE transformation. The earlier stage the screening, the raw the variables the better. Notice though, this is in the context of decision trees. 4. if you are running regressions, the strategy may be very different. 5. If you use EM's DT node to screen or select variables, make sure to use "split-based approach" first. That is, turn on variable importance, but do NOT turn on "observation- based approach =YES". Regarding binning, in the context of decision tree, binning generally makes selecting split cut on continuous variables less freely (and less optimal), But you said you bin them to 100, then you may still be OK in this regard. The general rule in using DT models is: interfere with raw inputs little as possible, no imputation, no transformation, no binning if there is no hard reason to do the contrary. Hope this helps? Best Regards. Jason Xin

View solution in original post

2 REPLIES 2
JasonXin
SAS Employee
Hi, Your title says "uplift". Do you actually mean just regular lift performance? Because uplift often refers treatment over control kind of incremental lift. Lift or uplift has different imputation and variable selection strategy. Assume you mean lift. 1. Your title says decision trees (DT). DT does not require missing value imputation. Also, imputation decision does not have to always go hand-in-hand with that for variable selection, although the two aften are made 'together'. 2. when number of input variables is large, a differentiation between V selection and V-Screening becomes increasingly needed. V-Selection traditionally has implication of "I have determined the best". versus V-screening refers to "roughly cut out the obviously weakest variables". To do the screening, you set the entry level/significance test values to be generous in using the DT node. It is not unusual for one to repeat several rounds of screening. You need to watch how many and what variables are left per each round of screening. Generally speaking, you should have more confidence in throwing out the obviously bad than picking the final elite. 3. Apparently you have settled on using Transformed as starting point of your question. I am not going to question that at this thread. If all possible, consider running screening BEFORE transformation. The earlier stage the screening, the raw the variables the better. Notice though, this is in the context of decision trees. 4. if you are running regressions, the strategy may be very different. 5. If you use EM's DT node to screen or select variables, make sure to use "split-based approach" first. That is, turn on variable importance, but do NOT turn on "observation- based approach =YES". Regarding binning, in the context of decision tree, binning generally makes selecting split cut on continuous variables less freely (and less optimal), But you said you bin them to 100, then you may still be OK in this regard. The general rule in using DT models is: interfere with raw inputs little as possible, no imputation, no transformation, no binning if there is no hard reason to do the contrary. Hope this helps? Best Regards. Jason Xin
mohammad__101
Fluorite | Level 6

Thanks Jason for the generous information, you have covered all my doubts and enquires 

Have a nice day!

Mohammed ElSofany

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1380 views
  • 0 likes
  • 2 in conversation