BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ajosh
Calcite | Level 5

Hi All,

I am working on a classification problem on an imbalanced dataset (binary target with Y:N = approx 1%:99%). I have to figure out the "english rules" which differentiate the target from non target. To accomplish this, I am using the following 2 approaches:

1) Input Data --> Filter (to remove irrelevant records) --> Sample (20:80 == Y:N) --> Data Partition --> Decision Tree --> Cut off node. Adjusted Priors (to counter the effect of balanced sampling) have been used.

2) Input Data --> Filter (to remove irrelevant records) --> Data Partition --> Start Group Node --> Decision Tree --> End Group node. I am using the boosting option from start group node properties. I get fairly good results from this approach, even if I am not used adjusted priors and decision weights.

Questions on Approach 1:

In the first approach, the cut off node was run twice (once at default 0.5 value and second at a lower threshold). The decision tree results didnt change due to this.

a) So does it mean that the new threshold can be applied only for scoring purpose only?

b) Can the rest of the records from the original dataset, which havent made into the sample used for creating the tree be still used for "scoring"?

c) How can I extract "english rules" from this output and sould I be referring to the sample dataset for the same ?

Questions on Approach 2:

It may sound silly, but would like to ask if there is really a need to use a cut off node when I am using the boosting procedure?

Appreciate your inputs on these questions.

Regards,

Aditya.

1 ACCEPTED SOLUTION

Accepted Solutions
jwexler
SAS Employee

Hi Ajosh, I have included responses inline from SAS R&D, thanks for your questions.

Questions on Approach 1:

In the first approach, the cut off node was run twice (once at default 0.5 value and second at a lower threshold). The decision tree results didnt change due to this.

  1. So does it mean that the new threshold can be applied only for scoring purpose only?

Yes,  it will create new EM_CUTOFF output variable which is another version of INTO variable based on the cutoff value.

  1. b) Can the rest of the records from the original dataset, which havent made into the sample used for creating the tree be still used for "scoring"?

No,  all the nodes after sampling node will use the sample data. If you like to score the whole data including non-sampled data, you can  use the score node with the whole data as a score input data.

  1. How can I extract "english rules" from this output and sould I be referring to the sample dataset for the same ?

English rule can be found at the result of tree node. Yes.

Questions on Approach 2:

It may sound silly, but would like to ask if there is really a need to use a cut off node when I am using the boosting procedure?

The cutoff node will not do anything for boosting in your flow.

View solution in original post

6 REPLIES 6
jwexler
SAS Employee

Hi Ajosh, I have included responses inline from SAS R&D, thanks for your questions.

Questions on Approach 1:

In the first approach, the cut off node was run twice (once at default 0.5 value and second at a lower threshold). The decision tree results didnt change due to this.

  1. So does it mean that the new threshold can be applied only for scoring purpose only?

Yes,  it will create new EM_CUTOFF output variable which is another version of INTO variable based on the cutoff value.

  1. b) Can the rest of the records from the original dataset, which havent made into the sample used for creating the tree be still used for "scoring"?

No,  all the nodes after sampling node will use the sample data. If you like to score the whole data including non-sampled data, you can  use the score node with the whole data as a score input data.

  1. How can I extract "english rules" from this output and sould I be referring to the sample dataset for the same ?

English rule can be found at the result of tree node. Yes.

Questions on Approach 2:

It may sound silly, but would like to ask if there is really a need to use a cut off node when I am using the boosting procedure?

The cutoff node will not do anything for boosting in your flow.

ajosh
Calcite | Level 5

Hi Jonathan,

Let me take couple of steps back and put some new questions I have encountered using cut off. The background of the study is still the same. Find english rules/patterns which differentiate the target from non target records, where the target records are way to low as compared to the non target ones. Their proportions are: approx 1%:99%.

The refined approach is as follows: Input Data --> Sample --> Data Partition --> Decision Tree --> Cut Off

Following are the features of the each of the nodes:

1) Input Data : Adjusted Priors and Decision Weights are being used. Adjusted Priors are kept same as Original Priors and Decision Weights are used to assign relative importance of each outcome of (TP, FP, TN, FN).

2) Sample Node: All target = Y (10%) and 9 times the number of targets are chosen randomly from non targets (90%). Proportion of target to non target is now 10%:90% in sample.

3) Decision Tree Node: Used default setting from property panel, except for "Use Decisions" (in Split search) is enabled to Yes.

4) Cut off node: Cut off node is run once to obtain the model diagnostics table (the one which has Count of TP/FP/TN, rates etc for cut off from 0.99 to 0.0 in increments of 0.1). A code was inserted to calculate the average profit column additionally, and determine the new cut off threshold. The average profit got maximized at 0.06 % as the threshold. This however happened only for one iteration with a specific random seed in the sample node. For other seeds, the tree didnt grow at all.

Questions:

1) Input Data: Is the use of adjusted priors same as original priors and decision weights justified if I am deriving patterns from a more balanced sample than using the original population. A research paper by Tamara Slipchenko (titled: "Case Study: Development of an HIV casefinding algorithm with SAS® Enterprise Miner™) says we need to use both the adjusted priors and decision weights. 

2) Sample Node: Is it necessary to use balanced sample during iterations? When I selected a balanced sample, the tree hardly grew to 2 to 3 levels, as against to 6 levels when the proportion of target to non target was 10%:90%.

3) Decision Tree Node: SAS EMiner Help Menu doesnt really explain the terms: "Use Decisions" and "Use Priors" in "Split Search". Could you let me know the usage of these options under standard scenarios?

4) Cut off node: I choose the new cut off of 0.06 from the updated model diagnostics table as explained in the earlier paragraphs for cut off node. When I run the cut off node with the new threshold, the pattern of the tree/tree structure doesnt change at all, (this is as per your earlier response). However, I find instances of 2 leaf nodes (from the same parent node) one has Y% = 30 and other as 65%, if i use the new cut-off to include additional rules, both of these leaf nodes give the same decision = Y. So, in that case would it mean that the variable used for split loses its significance, or I should be considering the parent node instead. Even in the latter case, this can again create a conflict with another leaf node.

Any help from you and other experienced users would be highly appreciated. Like wise do suggest any other alternate process flow, by which I can make my results more generic and not only on basis of a selected sample of all target and undersampled non targets. I have tried other techniques like Rule Induction, Gradient Boosting etc, but the results were not encouraging (zero count in TP and FP).

Regards,

Aditya.

ajosh
Calcite | Level 5

Sorry, but it seems I made a typo in the third para's Input Node specifications: I would like to correct it. The adjusted priors were not used at all in the only iteration that gave me result. Whereas the research paper states that we need to use both adjusted priors and decision weights as highlighted in the relevant "Question" in the input data. Await to hear inputs from Jonathan and others as well.

Thanks,

Aditya.

jwexler
SAS Employee

Thanks for the addiotional questions.  I will discuss with R&D and get back to you.  It may not be until 2014 Smiley Happy

Thanks,

Jonathan

jwexler
SAS Employee

Hi Aditya, I discussed with R&D and they had some thoughts on question 3.  For the other questions, I would recommend contacting SAS Tech Support at support.sas.com.  They can provide you 1:1 support for your questions. It would likely be more beneficial and efficient to meet your needs.  I would ask them for the EM Procedure Documentation which is available upon request.


The following Proc options in the Proc arbor doc explain the two options.

“DECSEARCH”  for "Use Decisions"

PRIORSSEARCH” for "Use Priors" in "Split Search 

Thanks,

Jonathan

jwexler
SAS Employee

And...just as I wrote that I received more info from R&D:

  1. 1.  You need to use the “Decision node”(not Decision Tree Node) after sampling node, you can specify  the adjusted prior  as your original prior (before sampling) and you will keep your data prior from the oversampling. If you don't use the decision node and you specify your adjusted prior as the original prior in input data source node, there will not be any predicted probability adjustment by the prior because the ratio is always 1.  Decision matrix is related to calculating profit and loss, it will be applied separately after the prior adjustment.

  1. 2.  For rare event modeling,  usually an oversampling is required, it is not necessary to make the sample balanced. However it depends on your data and analysis.

  1. 3. take a look at the Proc Arbor procedure document, it has the details.

   The proc option “DECSEARCH”  is for "Use Decisions"

   The proc option “PRIORSSEARCH”  is for "Use Priors" in "Split Search 

 

  1. 4. The Cutoff node will not impact on the decision tree node itself.  The cutoff node will create just the EM_CUTOFF variable, which is the classification variable resulted by the new cutoff value.

For exmple, your new cutoff is 0.06,  a piece of cutoff score code will be added to the end of the previous score code.

IF P_good_badgood > 0.06  THEN EM_CUTOFF = 1;

ELSE EM_CUTOFF = 0;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 3661 views
  • 3 likes
  • 2 in conversation