EMINER Decision Tree Analysis when only a SMALL proportion of dataset ...

EC27556 · Posted 01-20-2022 02:17 PM

I have datasets with 1 million observations and a mixture of variable types (i.e. categorical, interval etc.) Some datasets work great with decision trees - that is, where a larger proportion of data has the target variable "true"

For example, my target variable is binary - 1 for true and 0 for false.

In some cases, as few as 0.2% of cases have the target as true. When running DTs for these datasets, EMiner will not attempt to prune.

How do I get around this issue? I want to be able to find the things that split the whole dataset - so if I sample 10,000, where 10% have the true target variable and 90% don't, although I will find a split, it will be biased toward my biased 10,000 sample... i.e. i want to be able to say that 100% of people in my 1m have the target variable true if they are blonde and have size 3 feet etc.

Is it simply not possible to use decision trees when you have such a small proportion of data that have the target variable?

sbxkoenk · Posted 01-22-2022 01:23 PM

Hello,

I hope all your observations have the target variable, but not all your observations have the target event. 😉

(Single) Decision trees might not be the best choice for modelling rare events.
But it can be done.

You need to oversample the rare event or under-sample the non-event, and then you need to use the Enterprise Miner Target Profiler such that the algorithm knows about the difference between the sample priors and the real priors.

The priors are used for example to adjust the posterior probabilities for the real priors.

See here :
SAS® Enterprise Miner™ 15.2 Reference Help
Enterprise Miner Target Profiler
https://go.documentation.sas.com/doc/en/emref/15.2/n0z1mtvsscypjqn1ediv223jq5iy.htm

Good luck,

Koen

EC27556 · Posted 01-25-2022 12:08 PM

Ok, thanks, so in order of nodes it would be - data source - sample - target profiler - tree?

And how would the resulting tree look then?

Say I had 1m observations in total and 10k had the event true (1 in 100).

If I sampled so I had 90k where the event wasn't true (instead of 990k) and 10k where the event was true, how would the tree look? would the first node of the tree show 1=1% or 10%? Obviously I would like it to show 1% as that is the event proportion for the whole population.

EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable

Re: EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable

Re: EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable

EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable

Re: EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable

Re: EMINER Decision Tree Analysis when only a SMALL proportion of dataset has the target variable

Ready to join fellow brilliant minds for the SAS Hackathon?