I have datasets of above 1m - where the number of observations where the target variable is "true" ranges from 20% to 0.1%
When E Miner is constructing decision tree analysis, does it consider all 1m observations, or does it take a sample of the data when pruning?
I'm slightly concerned that if E-miner is sampling data before conducting pruning activities then there is a significant chance that any splits will be biased if say very few of the 0.1% target are selected - in many cases where the % is very small (often <1%) e miner cannot produce a tree - is it possibly because it is not randomly selecting any of the 0.1% for example?.
Linked to the above. Does anyone know what the optimal ratio of target 'hits' to 'non-hits' is with decision tree analysis? I.e. is around about 10% of your data having a hit for your target variable ok? I am considering of sampling my data before i conduct decision tree analysis so my data contains about 10% with the target variable true and 90% where it is not true.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.