Solved: Feature discretization Score set

dcortell · Posted 08-05-2022 10:26 AM

Hi all

I'm in the process of training a NB model based on continuos features that need Equal Frequency Discretization to be used.

Now, I will use proc rank for the above topic:

proc rank data=mydata groups=10 out=newdata;
var x z;
ranks decile_x decile_z;
run;

Now, the question mark I'm facing is if discretization need to be performed

- separately for train and score set

- appending train and score set together

It comes natural to me to go for the second approach, as the train and score set can have different distribution for each variable, which would cause different deciles to be generated and therefore different discretization results for the same variable values in the two dataset.

However I have to admit I'm not encountering much material about the above topic so I would ask to the community if any knowledge is here to share or links to be consulted

bests

sbxkoenk · Posted 08-05-2022 01:22 PM

See also my earlier response !!
Try PROC HPBIN.

Koen

View solution in original post

sbxkoenk · Posted 08-05-2022 11:45 AM

Hello @dcortell ,

If you make your continuous variables discrete by making deciles, you should do that for the training data only (or better said, on the basis of the training data only).

Then apply the same discretization (same boundaries as calculated for TRAIN) to validation, test and score datasets.

Of course, for validation, test and score data sets, this is no longer an exact equal frequency binning. But it should be close to that. If not, your analysis is not "robust" enough.

Good luck,

Koen

dcortell · Posted 08-05-2022 12:01 PM

When you say "Of course, for validation, test and score data sets, this is no longer an exact equal frequency binning" - You mean that some frequencies which could appear in the score set will not be available in the original train set, therefore leading to some frequencies in the score set to lack of a derived boundary including them, and therefore not getting a decile assignment as consequence?

sbxkoenk · Posted 08-05-2022 12:57 PM

Hello,

What I mean is :

If you are doing an equal-frequency binning on the training set for variable X1 , then you make groups with an equal amount of observations (which is not the case with equal-interval binning).

For 10 groups (deciles), you have 9 boundaries.

If you discretize variable X1 in VALID / TEST / SCORE dataset with these same 9 boundaries (calculated on the TRAIN set), you will no longer have equally-sized groups. However, the "bias" should NOT be too big of course otherwise TRAIN , VALID and TEST are too different from each other and that's not good for making a model. The distribution of the inputs should be more or less equal in all data sets.
That's why you calculate for example a population stability index (PSI) when scoring a new set (batch scoring). If there's a shift in the distribution(s), your model may not be valid anymore for the new data and may perform poorly.

Cheers,

Koen

dcortell · Posted 08-05-2022 12:10 PM

Also, any material to share on the topic?

sbxkoenk · Posted 08-05-2022 01:01 PM

Are you using Enterprise Miner or are you using Model Studio?
Which version? Enterprise Miner 15.2 , Model Studio in VIYA 2021.1.4 (August 2021)?

Or are you using plain code?

Thanks,

Koen

dcortell · Posted 08-05-2022 01:04 PM

Plain code

sbxkoenk · Posted 08-05-2022 01:20 PM

Plain code, hmmmm ....

That's a pity, EMiner and Model Studio have this capability code-free (no code solution).

And unfortunately,

PROC RANK has no CODE statement, no STORE statement and no SCORE statement.

You need one of these statements in order to be able to produce score code that you can apply to other datasets.

Does it matter whether you use PROC RANK or another procedure to do your equal-frequency binning?

I would use PROC HPBIN instead.

See here :
The essential guide to binning in SAS
By Rick Wicklin on The DO Loop August 7, 2019
https://blogs.sas.com/content/iml/2019/08/07/essential-guide-binning-sas.html

Koen

sbxkoenk · Posted 08-05-2022 01:02 PM

Hello,

Are you using Enterprise Miner or are you using Model Studio?
Which version? Enterprise Miner 15.2 in SAS 9.4 M7, Model Studio in VIYA 2021.1.4 (August 2021) ??

Tell us.

Or are you using plain code?

Thanks,

Koen

dcortell · Posted 08-05-2022 01:15 PM

I was checking proc rank output, and seems the cutoff points for each rank are not provided as output. As proxy, I could extract max and min values for each variable in correspondence on each rank as proxy or cutoffs, but I'm curios if there is some proc or option doing that in SAS

sbxkoenk · Posted 08-05-2022 01:22 PM

See also my earlier response !!
Try PROC HPBIN.

Koen

Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Re: Feature discretization Score set

Ready to join fellow brilliant minds for the SAS Hackathon?