BookmarkSubscribeRSS Feed
ajosh
Calcite | Level 5

Hi All,

I am working on a classification problem, with an highly imbalanced dataset where bads to goods ratio is around 1:99. Total records = 0.4 million. To tackle this imbalanced issue, I have used data partitioning into training and validation dataset and then using "boosting"approach by embedding decision tree node between the start and end groups with total of 10 iterations.

As I understand, boosting creates multiple iterations of trees with an overall objective of minimizing the highly misclassified cases/records by assigning them higher weights as compared to the correctly classified cases.

I later used the output dataset from the end group nodes by merging the training and validation dataset which has the entire dataset. However, I observed that for every node number, the predicted probability of target = Y is not the same throughout the records which have the same node number. Also the range of these predicted probabilities overlap across node numbers. Also, the output from the end groups result window shows around 60% true positive rate and 70% true negative rate, which means some good amount of classification is happening due to boosting approach.

My end objective is to derive patterns/if then rules from such a dataset. Is anyone aware of how can this be accomplished (is there any other node that needs to be used on the exported dataset of end group node and so on)??

Would highly appreciate any leads on how this can be done !!

Regards.

8 REPLIES 8
JasonXin
SAS Employee

Hi, Ajosh,

Not sure which version of EM you have. Perhaps the solution is 'versionless'?

On my version 12.3 EM, right-mouse click on the Gradient Boosting Node (suppose you have finished a successful run). In the drop-down, select Results entry. In the upcoming windows, up left corner there are 4? menus. One should be View. Select View --> SAS Results --> Flow Code. For Boosting you should see a long What-if. Take a look if this is what you are looking for.

A much 'cleaner' score piece can be found, after the model is finalized, at View --> Scoring --> SAS Code.

Best Regards

Jason Xin

(from SAS)

ajosh
Calcite | Level 5

Hi Jason,

I am using SAS EM 7.1 for the above mentioned analysis. Thanks for your response, I shall check the flow code to see whether I can draw the pattern using series of if-then statements/rules. I would like to point out that I am not using gradient boosting node, but the following process flow: Input Data --> Filter (to remove the irrelevant records based on business criteria) --> Data Partition (70:30, train:validate) --> Start Groups --> Decision Tree --> End Groups. In the options for Start Group node in this process flow, I am using the "boosting" option with number of loops/iterations = 10. No undersampling on majority class as well as adjusted priors is done. Even then I am able to get a sufficiently good true positive and true negative rate. As I mentioned earlier, I shall check the flow code to see if I can make any sense out of it. One follow up question would be: How do I determine the correct cut-off amongst the list of rules found by the above procedure? (I am conversant with the cut-off node and how to use the same to determine the optimal cutoff (basis change count of TP cases and so on)). Does the cut off threshold determination apply to the above scenario as well.

Also, I had few questions on deriving patterns from an imbalanced data as follows:

Note: I am using the following process flow for the same analysis: Input Data --> Filter --> Sample --> Decision Tree --> Cut Off Node. I am using the adjusted priors and decision weights in the decision processing options of the Input data with the following characteristics: Adjusted Priors (same as original priors, used to counter the effect of taking a balance 50:50 sample) and decision weights to enforce the relative importance of each of the 4 outcomes, TP, FP, TN and FN.

Questions are:

1) When I used the adjusted priors and decision weights in input data, and enabled the options of "Use Priors" = Yes, "Use Decisions" = No in decision tree node properties, I get a nicely developed decision tree, however event classification table shows zero true positives and false positives? What can be the possible reasons for not getting any TPs? My current dataset doesnt have any profit variables against all the records, is it mandatory to have any such variable in the first place?

2) How should I still implement the changed cut-off node to rectify the decisions from the decision tree? Would I still need the cut-off even if I am using adjusted priors as well as decision weights? If yes, is there a way to correct the decisions made by the tree node (I have run the cut-off node once, and changed the default from 0.5 to say 0.28 and re-run the node again, however, the decision tree has not changed a bit.)

Would appreciate if you can guide me on the correct approach, at the earliest. My private email id is: adityajosh@gmail.com. Awaiting to hear from you. Thanks for going through such a long description.

Regards,

Aditya.

Reeza
Super User

There should be English language Rules should give you the IF/THEN rules from a tree that are usable from SAS. 

Getting Started with SAS(R) Enterprise Miner(TM) 6.1

I'd suggest posting your new questions as a new thread.

JasonXin
SAS Employee

Ajosh,

Let me find a virtual machine that runs EM7.1. There are quite bit changes between EM7.1 and 12.3. Will get back to you later.

Best

Jason Xin

JasonXin
SAS Employee

Ajosh,

Let us focus on your original question regarding boosting, specifically, this portion

"

I later used the output dataset from the end group nodes by merging the training and validation dataset which has the entire dataset. However, I observed that for every node number, the predicted probability of target = Y is not the same throughout the records which have the same node number. Also the range of these predicted probabilities overlap across node numbers. Also, the output from the end groups result window shows around 60% true positive rate and 70% true negative rate, which means some good amount of classification is happening due to boosting approach.

My end objective is to derive patterns/if then rules from such a dataset. Is anyone aware of how can this be accomplished (is there any other node that needs to be used on the exported dataset of end group node and so on)??

"

1. For the first portion, You merged the training and validation data sets (You select the end group node, and went to Exported Data button to find your data sets underneath, right?). The model trained and the model validated physically are two different ones, although logically the same one, since one is the validated, balanced version of the other. Wonder if you can just look at either one of them at a time. Whether it is to report model performance, or extract score code, analytically you should stick with what comes off the validation. It is indeed a good practice to try to minimize difference between training and validation data sets. In other words, if the gap is big and varies  from attempt to attempt, you may consider training 'better' to close the gap. Also keep on to see if performance off the validation data set is improving; or at least is stable.

2. As for the rule, please disregard my previous remark surrounding. That was largely correct but I thought you were doing SGB. I checked and don't see any major difference on this subject between EM7.1 and EM12.3 (the two user guides appear largely the same on the group processing). So I am going to use EM12.3 to speak about EM 7.1 on this subject. For the End Group processing, if you go check for Flow Code and Score Code, it has anything but Flow Code and Score Code for group processing, unlike SGB. I can expand quite a bit on this. I want to keep it on focus on what you want. If you can clarify a bit about why you need "derive patterns/if then rules from such a dataset." For example, are you trying to port it to elsewhere to score, or just to study further on the mechanics of the boosting process?

I agree with Reeza for your follow-up questions on cut-off... you may get better response performance if you can post them as separate questions.

Best Regards

Jason Xin

ajosh
Calcite | Level 5

Hi Jason,

As I understand, from the original question on boosting, you are right I have merged the "exported" train and validate dataset from the End Group Node and found that against every node, the predicted probability target = Y has multiple values and not a single distinct value like in case of regular DTs. Now, I isolated the "validate" part of the exported data set and found the same phenomenon as described earlier. However, I found in this "exported validate dataset", 3 such nodes (out of the 50 odd nodes or so) are such that min and max probabilities associated with them were zero. For others, I see that the min and max are different as well as these ranges overlap across various nodes.

Lastly, I would clarify on what I meant by deriving patterns: I want to find out English rules for every leaf node even though I am using boosted tree approach. These can be easily derived from the results window of a regular DT "run". I am not aware if this is possible in the first case. Or if it is possible, do I have to use any other node from EMiner so that such rules can be easily derived ?

Regards,

Aditya.

ajosh
Calcite | Level 5

Was able to figure out a way to find patterns from the output of boosted trees. Its as follows: merge the exported data from the end grp node (could be any other modeling node), feed this dataset as an input data, set the role of column: _node_ as target and reject as the role for original target, attach a decision tree node to this input data, run the node to grow the largest tree, check the english rules for this tree.

Because of this process we get some pure leaf nodes, for the earlier boosted nodes, with the realized values of the independent variables.

Only issue is some of the leaf nodes may not be pure meaning have some proportion of more than one of boosted nodes. Need to see how such nodes can be addressed.. hope this info is useful to those facing similar issues..  would like to hear opinions from you all.

Regards,

Aditya

anna_holland
SAS Employee

Thank you, JasonXin, for all your help! Ajosh, I think you're all set. If you could, please mark this thread as answered correctly. If another user runs into the same issue, they can use this thread as a reference.

Regards,

-Anna-Marie

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2219 views
  • 3 likes
  • 4 in conversation