BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chemicalab
Fluorite | Level 6

Hi all,

Is there any source or idea on how to perform optimal binning in SAS Base besides complicated algorithms  that usually dont work?

I am looking for something straightforward or more simple version that is easy to understand.

Thank you in advance 

1 ACCEPTED SOLUTION

Accepted Solutions
M_Maldonado
Barite | Level 11

cart or chaid are just decision tree algorithms. yes, you will find some groupings using decision trees. but I think wants an optimal grouping solution, not just a grouping. tree-based grouping is a good start though. although EM binning has that too, I think.

View solution in original post

22 REPLIES 22
ballardw
Super User

What do you have to bin and how many bins do think you'll need? A brief description of input data and desired output is helpful.

chemicalab
Fluorite | Level 6

well i have lets say two continuous variables against a binary response, number of bins optimal in accordance with what i have, although i prefer lets say max 4 or 5 so i able to control the estimate signs in the regression later on

ballardw
Super User

If you know the ranges of interest in the continuous variables I would recommend a custom format for each variable. Most of the analysis procedures will use the formatted value either by default or can be set to use the format. If the ranges don't work quite the way you want then you just change the format definition and don't have to create new variables.


Reeza
Super User

There's also the question of why bin entirely, rather than use the continuous variable?


If your in eMiner, consider running a tree and seeing where the cutoff's occur for those variables.

chemicalab
Fluorite | Level 6

unfortunately i am not in EM otherwise i would use interactive binning to define myself based on event rate and gini, in general binning is preferable to handle rare levels and missing as well as outliers. Besides the relationship between the continuous and target is not always linear so it would lead to instable model cause the effects wouldnt be captured

Reeza
Super User

Ok, Are you looking for how to determine the cutoffs for the bins or how to implement said cutoffs?

If how to implement, suggestion of formats is the one I'd also recommend.

chemicalab
Fluorite | Level 6

yes i guess thats what optimal binning would do so yes i am looking for the cut off values for creating bins that would give me the highest gini ratio.

I was just wondering if there was any tested way in code so i could try out instead of depending on some descriptives and if then else clauses.

Basically thats what i need cause i can calcultate the gini myself once i have the groups formed.

chemicalab
Fluorite | Level 6

any suggestions or example  on the format ? i am not sure i follow

thnx

M_Maldonado
Barite | Level 11

Read about dynamic formats and informats here: http://www.lexjansen.com/pharmasug/2005/posters/po06.pdf

Just one of many papers in the subject.

What are the complicated algorithms that are not working for you?

chemicalab
Fluorite | Level 6

a lot you find on the net that are supposively creating optimal bins, also in data mining preparation tools their macros do not work as they should , anyways i will look into the paper you sent me , thank you for the reply

M_Maldonado
Barite | Level 11

I know advanced constraints have to be tweaked sometimes to get the most out of Interactive Binning or enterprise miner. If you are not getting the results you expect, contact sas tech support http://support.sas.com/techsup/contact/. they usually reply very fast.

good luck!

chemicalab
Fluorite | Level 6

yeas you are right, in EM and the interactive binning node i have spent a lot of times creating own splits casue the ones you get are not always so good , in most cases it doesnt create bins even there are opportunities for it, i guess i will create something in a macro form to change cut offs every time , cause i am afraid that something automated that you usually encounter means more number of bins then possible to support later on the regression cause its just looking to maximize the gini ratio.

Thank you for your time

Reeza
Super User

That sounds like a CART or CHAID process with a single variable perhaps? Could you use the CHAID macro out there, you may need to contact

http://listserv.uga.edu/cgi-bin/wa?A2=ind1309C&L=sas-l&D=0&P=5029

M_Maldonado
Barite | Level 11

cart or chaid are just decision tree algorithms. yes, you will find some groupings using decision trees. but I think wants an optimal grouping solution, not just a grouping. tree-based grouping is a good start though. although EM binning has that too, I think.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 22 replies
  • 11791 views
  • 2 likes
  • 8 in conversation