BookmarkSubscribeRSS Feed
EC27556
Quartz | Level 8

I have datasets with 1 million observations and a mixture of variable types (i.e. categorical, interval etc.) Some datasets work great with decision trees - that is, where a larger proportion of data has the target variable "true"

 

For example, my target variable is binary - 1 for true and 0 for false.

 

In some cases, as few as 0.2% of cases have the target as true. When running DTs for these datasets, EMiner will not attempt to prune.

 

How do I get around this issue? I want to be able to find the things that split the whole dataset - so if I sample 10,000, where 10% have the true target variable and 90% don't, although I will find a split, it will be biased toward my biased 10,000 sample... i.e. i want to be able to say that 100% of people in my 1m have the target variable true if they are blonde and have size 3 feet etc.

 

Is it simply not possible to use decision trees when you have such a small proportion of data that have the target variable?

2 REPLIES 2
sbxkoenk
SAS Super FREQ

Hello,

 

I hope all your observations have the target variable, but not all your observations have the target event😉

 

(Single) Decision trees might not be the best choice for modelling rare events.
But it can be done.

 

You need to oversample the rare event or under-sample the non-event, and then you need to use the Enterprise Miner Target Profiler such that the algorithm knows about the difference between the sample priors and the real priors.

The priors are used for example to adjust the posterior probabilities for the real priors.

See here :
SAS® Enterprise Miner™ 15.2 Reference Help
Enterprise Miner Target Profiler
https://go.documentation.sas.com/doc/en/emref/15.2/n0z1mtvsscypjqn1ediv223jq5iy.htm

 

Good luck,

Koen

EC27556
Quartz | Level 8

Ok, thanks, so in order of nodes it would be - data source - sample - target profiler - tree?

 

And how would the resulting tree look then?

 

Say I had 1m observations in total and 10k had the event true (1 in 100).

 

If I sampled so I had 90k where the event wasn't true (instead of 990k) and 10k where the event was true, how would the tree look? would the first node of the tree show 1=1% or 10%? Obviously I would like it to show 1% as that is the event proportion for the whole population.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 516 views
  • 0 likes
  • 2 in conversation