Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Scoring undersampled imbalanced dataset

Reply
New Contributor
Posts: 4

Scoring undersampled imbalanced dataset

[ Edited ]


Hello,
I am working on an imbalanced dataset with 15% of the cases belonging to the class of interest. I have used a stratified sample on training and validation set with equal sizes to overcome the strong bias of the model towards the majority class.

 

I am now having troubles scoring my model as I would like to use a test set with the original proportions (15%-85%). I have tried to edit the target profile assigning those prior proportions from the "Score using test dataset" node in the attached figure, but when I run the the "Score[apply]" node it will still use a score set with equal probabilities (I can easily see this from the Insight node).

Does anybody know how to overcome this problem? I am using SAS Enterprise Miner in SAS 9.3.

 

All help greatly appreciated.
 
Capture.JPG
Super User
Posts: 17,840

Re: Scoring undersampled imbalanced dataset

I'm just a touch confused and possibly out of my depths here, but from what I understand of scoring - the prior probabilities do not come into play even in a tree diagram.  The rules are applied the same regardless of the proportions in the sample. 

 

Did you manually create your partitioned data or use a prior probability setting to set up the 50/50% data sets?

New Contributor
Posts: 4

Re: Scoring undersampled imbalanced dataset

Thank you Reeza.
I created a 50-50 sample from th Input Node by going on Stratification-->Options--> Equal Size.
This appears to work fine when a Tree is run, however problems arise when I try the scoring as I would like to use a test set with the original proportions.
Finally I have created 2 files in SQL, one for training and validation (with 50-50 proportions) and one for test with the original proportions. I simply create a different partition as I take my test set from another file. 

I also noticed that - when I created the 50-50 sample from the original file - the decision tree worked fine and gave me a nice confusion matrix but neural nets and regression ignored the equal size option resulting in a confusion matrix with 14% of the observations belonging to the class of interest. I now don't have this problem anymore as I solved it manually but do you know if the Stratification-->Options--> Equal Size normally presents such issues?

Many thanks,

a_bloch

Ask a Question
Discussion stats
  • 2 replies
  • 310 views
  • 0 likes
  • 2 in conversation