Hi All,
I am working on a classification problem on an imbalanced dataset (binary target with Y:N = approx 1%:99%). I have to figure out the "english rules" which differentiate the target from non target. To accomplish this, I am using the following 2 approaches:
1) Input Data --> Filter (to remove irrelevant records) --> Sample (20:80 == Y:N) --> Data Partition --> Decision Tree --> Cut off node. Adjusted Priors (to counter the effect of balanced sampling) have been used.
2) Input Data --> Filter (to remove irrelevant records) --> Data Partition --> Start Group Node --> Decision Tree --> End Group node. I am using the boosting option from start group node properties. I get fairly good results from this approach, even if I am not used adjusted priors and decision weights.
Questions on Approach 1:
In the first approach, the cut off node was run twice (once at default 0.5 value and second at a lower threshold). The decision tree results didnt change due to this.
a) So does it mean that the new threshold can be applied only for scoring purpose only?
b) Can the rest of the records from the original dataset, which havent made into the sample used for creating the tree be still used for "scoring"?
c) How can I extract "english rules" from this output and sould I be referring to the sample dataset for the same ?
Questions on Approach 2:
It may sound silly, but would like to ask if there is really a need to use a cut off node when I am using the boosting procedure?
Appreciate your inputs on these questions.
Regards,
Aditya.
Hi Ajosh, I have included responses inline from SAS R&D, thanks for your questions.
Questions on Approach 1:
In the first approach, the cut off node was run twice (once at default 0.5 value and second at a lower threshold). The decision tree results didnt change due to this.
Yes, it will create new EM_CUTOFF output variable which is another version of INTO variable based on the cutoff value.
No, all the nodes after sampling node will use the sample data. If you like to score the whole data including non-sampled data, you can use the score node with the whole data as a score input data.
English rule can be found at the result of tree node. Yes.
Questions on Approach 2:
It may sound silly, but would like to ask if there is really a need to use a cut off node when I am using the boosting procedure?
The cutoff node will not do anything for boosting in your flow.
Hi Ajosh, I have included responses inline from SAS R&D, thanks for your questions.
Questions on Approach 1:
In the first approach, the cut off node was run twice (once at default 0.5 value and second at a lower threshold). The decision tree results didnt change due to this.
Yes, it will create new EM_CUTOFF output variable which is another version of INTO variable based on the cutoff value.
No, all the nodes after sampling node will use the sample data. If you like to score the whole data including non-sampled data, you can use the score node with the whole data as a score input data.
English rule can be found at the result of tree node. Yes.
Questions on Approach 2:
It may sound silly, but would like to ask if there is really a need to use a cut off node when I am using the boosting procedure?
The cutoff node will not do anything for boosting in your flow.
Hi Jonathan,
Let me take couple of steps back and put some new questions I have encountered using cut off. The background of the study is still the same. Find english rules/patterns which differentiate the target from non target records, where the target records are way to low as compared to the non target ones. Their proportions are: approx 1%:99%.
The refined approach is as follows: Input Data --> Sample --> Data Partition --> Decision Tree --> Cut Off
Following are the features of the each of the nodes:
1) Input Data : Adjusted Priors and Decision Weights are being used. Adjusted Priors are kept same as Original Priors and Decision Weights are used to assign relative importance of each outcome of (TP, FP, TN, FN).
2) Sample Node: All target = Y (10%) and 9 times the number of targets are chosen randomly from non targets (90%). Proportion of target to non target is now 10%:90% in sample.
3) Decision Tree Node: Used default setting from property panel, except for "Use Decisions" (in Split search) is enabled to Yes.
4) Cut off node: Cut off node is run once to obtain the model diagnostics table (the one which has Count of TP/FP/TN, rates etc for cut off from 0.99 to 0.0 in increments of 0.1). A code was inserted to calculate the average profit column additionally, and determine the new cut off threshold. The average profit got maximized at 0.06 % as the threshold. This however happened only for one iteration with a specific random seed in the sample node. For other seeds, the tree didnt grow at all.
Questions:
1) Input Data: Is the use of adjusted priors same as original priors and decision weights justified if I am deriving patterns from a more balanced sample than using the original population. A research paper by Tamara Slipchenko (titled: "Case Study: Development of an HIV casefinding algorithm with SAS® Enterprise Miner™) says we need to use both the adjusted priors and decision weights.
2) Sample Node: Is it necessary to use balanced sample during iterations? When I selected a balanced sample, the tree hardly grew to 2 to 3 levels, as against to 6 levels when the proportion of target to non target was 10%:90%.
3) Decision Tree Node: SAS EMiner Help Menu doesnt really explain the terms: "Use Decisions" and "Use Priors" in "Split Search". Could you let me know the usage of these options under standard scenarios?
4) Cut off node: I choose the new cut off of 0.06 from the updated model diagnostics table as explained in the earlier paragraphs for cut off node. When I run the cut off node with the new threshold, the pattern of the tree/tree structure doesnt change at all, (this is as per your earlier response). However, I find instances of 2 leaf nodes (from the same parent node) one has Y% = 30 and other as 65%, if i use the new cut-off to include additional rules, both of these leaf nodes give the same decision = Y. So, in that case would it mean that the variable used for split loses its significance, or I should be considering the parent node instead. Even in the latter case, this can again create a conflict with another leaf node.
Any help from you and other experienced users would be highly appreciated. Like wise do suggest any other alternate process flow, by which I can make my results more generic and not only on basis of a selected sample of all target and undersampled non targets. I have tried other techniques like Rule Induction, Gradient Boosting etc, but the results were not encouraging (zero count in TP and FP).
Regards,
Aditya.
Sorry, but it seems I made a typo in the third para's Input Node specifications: I would like to correct it. The adjusted priors were not used at all in the only iteration that gave me result. Whereas the research paper states that we need to use both adjusted priors and decision weights as highlighted in the relevant "Question" in the input data. Await to hear inputs from Jonathan and others as well.
Thanks,
Aditya.
Thanks for the addiotional questions. I will discuss with R&D and get back to you. It may not be until 2014
Thanks,
Jonathan
Hi Aditya, I discussed with R&D and they had some thoughts on question 3. For the other questions, I would recommend contacting SAS Tech Support at support.sas.com. They can provide you 1:1 support for your questions. It would likely be more beneficial and efficient to meet your needs. I would ask them for the EM Procedure Documentation which is available upon request.
The following Proc options in the Proc arbor doc explain the two options.
“DECSEARCH” for "Use Decisions"
PRIORSSEARCH” for "Use Priors" in "Split Search
Thanks,
Jonathan
And...just as I wrote that I received more info from R&D:
The proc option “DECSEARCH” is for "Use Decisions"
The proc option “PRIORSSEARCH” is for "Use Priors" in "Split Search
For exmple, your new cutoff is 0.06, a piece of cutoff score code will be added to the end of the previous score code.
IF P_good_badgood > 0.06 THEN EM_CUTOFF = 1;
ELSE EM_CUTOFF = 0;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.