I have datasets with 1 million observations and a mixture of variable types (i.e. categorical, interval etc.) Some datasets work great with decision trees - that is, where a larger proportion of data has the target variable "true"
For example, my target variable is binary - 1 for true and 0 for false.
In some cases, as few as 0.2% of cases have the target as true. When running DTs for these datasets, EMiner will not attempt to prune.
How do I get around this issue? I want to be able to find the things that split the whole dataset - so if I sample 10,000, where 10% have the true target variable and 90% don't, although I will find a split, it will be biased toward my biased 10,000 sample... i.e. i want to be able to say that 100% of people in my 1m have the target variable true if they are blonde and have size 3 feet etc.
Is it simply not possible to use decision trees when you have such a small proportion of data that have the target variable?
... View more