Hi all
I'm in the process of training a NB model based on continuos features that need Equal Frequency Discretization to be used.
Now, I will use proc rank for the above topic:
proc rank data=mydata groups=10 out=newdata;
var x z;
ranks decile_x decile_z;
run;
Now, the question mark I'm facing is if discretization need to be performed
- separately for train and score set
- appending train and score set together
It comes natural to me to go for the second approach, as the train and score set can have different distribution for each variable, which would cause different deciles to be generated and therefore different discretization results for the same variable values in the two dataset.
However I have to admit I'm not encountering much material about the above topic so I would ask to the community if any knowledge is here to share or links to be consulted
bests
Hello @dcortell ,
If you make your continuous variables discrete by making deciles, you should do that for the training data only (or better said, on the basis of the training data only).
Then apply the same discretization (same boundaries as calculated for TRAIN) to validation, test and score datasets.
Of course, for validation, test and score data sets, this is no longer an exact equal frequency binning. But it should be close to that. If not, your analysis is not "robust" enough.
Good luck,
Koen
When you say "Of course, for validation, test and score data sets, this is no longer an exact equal frequency binning" - You mean that some frequencies which could appear in the score set will not be available in the original train set, therefore leading to some frequencies in the score set to lack of a derived boundary including them, and therefore not getting a decile assignment as consequence?
Hello,
What I mean is :
If you are doing an equal-frequency binning on the training set for variable X1 , then you make groups with an equal amount of observations (which is not the case with equal-interval binning).
For 10 groups (deciles), you have 9 boundaries.
If you discretize variable X1 in VALID / TEST / SCORE dataset with these same 9 boundaries (calculated on the TRAIN set), you will no longer have equally-sized groups. However, the "bias" should NOT be too big of course otherwise TRAIN , VALID and TEST are too different from each other and that's not good for making a model. The distribution of the inputs should be more or less equal in all data sets.
That's why you calculate for example a population stability index (PSI) when scoring a new set (batch scoring). If there's a shift in the distribution(s), your model may not be valid anymore for the new data and may perform poorly.
Cheers,
Koen
Also, any material to share on the topic?
Are you using Enterprise Miner or are you using Model Studio?
Which version? Enterprise Miner 15.2 , Model Studio in VIYA 2021.1.4 (August 2021)?
Or are you using plain code?
Thanks,
Koen
Plain code
Plain code, hmmmm ....
That's a pity, EMiner and Model Studio have this capability code-free (no code solution).
And unfortunately,
PROC RANK has no CODE statement, no STORE statement and no SCORE statement.
You need one of these statements in order to be able to produce score code that you can apply to other datasets.
Does it matter whether you use PROC RANK or another procedure to do your equal-frequency binning?
I would use PROC HPBIN instead.
See here :
The essential guide to binning in SAS
By Rick Wicklin on The DO Loop August 7, 2019
https://blogs.sas.com/content/iml/2019/08/07/essential-guide-binning-sas.html
Koen
Hello,
Are you using Enterprise Miner or are you using Model Studio?
Which version? Enterprise Miner 15.2 in SAS 9.4 M7, Model Studio in VIYA 2021.1.4 (August 2021) ??
Tell us.
Or are you using plain code?
Thanks,
Koen
I was checking proc rank output, and seems the cutoff points for each rank are not provided as output. As proxy, I could extract max and min values for each variable in correspondence on each rank as proxy or cutoffs, but I'm curios if there is some proc or option doing that in SAS
See also my earlier response !!
Try PROC HPBIN.
Koen
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.