Does E Miner take a sample of data when constructing decision trees with large datasets?

EC27556 — Tue, 18 Jan 2022 12:07:47 GMT

I have datasets of above 1m - where the number of observations where the target variable is "true" ranges from 20% to 0.1%

When E Miner is constructing decision tree analysis, does it consider all 1m observations, or does it take a sample of the data when pruning?

I'm slightly concerned that if E-miner is sampling data before conducting pruning activities then there is a significant chance that any splits will be biased if say very few of the 0.1% target are selected - in many cases where the % is very small (often <1%) e miner cannot produce a tree - is it possibly because it is not randomly selecting any of the 0.1% for example?.

Linked to the above. Does anyone know what the optimal ratio of target 'hits' to 'non-hits' is with decision tree analysis? I.e. is around about 10% of your data having a hit for your target variable ok? I am considering of sampling my data before i conduct decision tree analysis so my data contains about 10% with the target variable true and 90% where it is not true.

Re: Does E Miner take a sample of data when constructing decision trees with large datasets?

pink_poodle — Sat, 05 Feb 2022 03:22:00 GMT

SAS Miner can split the data into training, testing and validation datasets. This partition can be user-defined: https://support.sas.com/documentation/onlinedoc/miner/casestudy_59123.pdf.
Sensitivity parameter shows how well the model identifies positive cases. If “hit” = true positive, and “miss” = false negative, then sensitivity = hits/(hits+misses). A 1:1 hit:miss ratio results in sensitivity of 0.5; 2:1 - sensitivity of 0.66. A sensitivity between 70 and 100% is considered good.

topic Does E Miner take a sample of data when constructing decision trees with large datasets? in SAS Data Science

Does E Miner take a sample of data when constructing decision trees with large datasets?

Re: Does E Miner take a sample of data when constructing decision trees with large datasets?