BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
dcortell
Pyrite | Level 9

Hi all

 

I'm in the process of training a NB model based on continuos features that need Equal Frequency Discretization to be used.

 

Now, I will use proc rank for the above topic:

 

proc rank data=mydata groups=10 out=newdata;
var x z;
ranks decile_x decile_z;
run;

Now, the question mark I'm facing is if discretization need to be performed

- separately for train and score set

- appending train and score set together

 

It comes natural to me to go for the second approach, as the train and score set can have different distribution for each variable, which would cause different deciles to be generated and therefore different discretization results for the same variable values in the two dataset.

 

However I have to admit I'm not encountering much material about the above topic so I would ask to the community if any knowledge is here to share or links to be consulted

 

bests

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

See also my earlier response !!
Try PROC HPBIN.

 

Koen

View solution in original post

10 REPLIES 10
sbxkoenk
SAS Super FREQ

Hello @dcortell ,

 

If you make your continuous variables discrete by making deciles, you should do that for the training data only (or better said, on the basis of the training data only).

 

Then apply the same discretization (same boundaries as calculated for TRAIN) to validation, test and score datasets.

Of course, for validation, test and score data sets, this is no longer an exact equal frequency binning. But it should be close to that. If not, your analysis is not "robust" enough.

 

Good luck,

Koen

dcortell
Pyrite | Level 9

When you say "Of course, for validation, test and score data sets, this is no longer an exact equal frequency binning" - You mean that some frequencies which could appear in the score set will not be available in the original train set, therefore leading to some frequencies in the score set to lack of a derived boundary including them, and therefore not getting a decile assignment as consequence?

sbxkoenk
SAS Super FREQ

Hello,

 

What I mean is :

If you are doing an equal-frequency binning on the training set for variable X1 , then you make groups with an equal amount of observations (which is not the case with equal-interval binning).

For 10 groups (deciles), you have 9 boundaries.

If you discretize variable X1 in VALID / TEST / SCORE dataset with these same 9 boundaries (calculated on the TRAIN set), you will no longer have equally-sized groups. However, the "bias" should NOT be too big of course otherwise TRAIN , VALID and TEST are too different from each other and that's not good for making a model. The distribution of the inputs should be more or less equal in all data sets.
That's why you calculate for example a population stability index (PSI) when scoring a new set (batch scoring). If there's a shift in the distribution(s), your model may not be valid anymore for the new data and may perform poorly. 

 

Cheers,

Koen

dcortell
Pyrite | Level 9

Also, any material to share on the topic?

sbxkoenk
SAS Super FREQ

Are you using Enterprise Miner or are you using Model Studio?
Which version? Enterprise Miner 15.2 , Model Studio in VIYA 2021.1.4 (August 2021)?

 

Or are you using plain code?

 

Thanks,

Koen

dcortell
Pyrite | Level 9

Plain code

sbxkoenk
SAS Super FREQ

Plain code, hmmmm ....

That's a pity, EMiner and Model Studio have this capability code-free (no code solution). 

 

And unfortunately,

PROC RANK has no CODE statement, no STORE statement and no SCORE statement.

 

You need one of these statements in order to be able to produce score code that you can apply to other datasets.

 

Does it matter whether you use PROC RANK or another procedure to do your equal-frequency binning?

I would use PROC HPBIN instead.


See here :
The essential guide to binning in SAS
By Rick Wicklin on The DO Loop August 7, 2019
https://blogs.sas.com/content/iml/2019/08/07/essential-guide-binning-sas.html

 

Koen

sbxkoenk
SAS Super FREQ

Hello,

 

Are you using Enterprise Miner or are you using Model Studio?
Which version? Enterprise Miner 15.2 in SAS 9.4 M7, Model Studio in VIYA 2021.1.4 (August 2021) ??

Tell us.

 

Or are you using plain code?

 

Thanks,

Koen

dcortell
Pyrite | Level 9

I was checking proc rank output, and seems the cutoff points for each rank are not provided as output. As proxy, I could extract max and min values for each variable in correspondence on each rank as proxy or cutoffs, but I'm curios if there is some proc or option doing that in SAS

sbxkoenk
SAS Super FREQ

See also my earlier response !!
Try PROC HPBIN.

 

Koen

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 923 views
  • 2 likes
  • 2 in conversation