Solved: Why the output of the proc hpsplit is uncertain

su35 · Posted 05-22-2021 10:11 AM

I run the following code several times and got different output. The SAS/STAT version is 15.1. Does the nodestats= is incompatible with this version?

proc hpsplit data=train leafsize=2213;
model loan_status =mths_since_last_delinq;
output nodestats=hp_tree;
run;

sbxkoenk · Posted 05-26-2021 12:53 PM

Hello @su35 ,

OK, 107 distinct levels (+1 level for missing, … I guess these are accounts which never had a delinquency) is enough to consider that variable as an interval input.

Very strange that you stumbled upon 2 random seeds which have so different results.

I guess most of the other seeds you can imagine (any number > 0) will result in solution 1 or solution 2, no?

If 10 more seeds give you the split, then that split should be done.
If these 10 more seeds result in no split, then no split should be done.

But again you can also work without cross-validation (no seed needed and always the same solution unless you are heavily doing distributed processing, then minor differences might be possible).

Heavy distributed processing, like you can do in SAS VIYA, is not always giving you deterministic results.

Cheers,

Koen

View solution in original post

sbxkoenk · Posted 05-22-2021 10:28 AM

Hello,

Which version of SAS are you using? Find out by submitting:

%PUT &=sysvlong;

I suppose you will get always the same result if you specify a seed:

SEED=

Specifies the random number seed to use for cross validation

like

proc hpsplit data=train leafsize=2213 seed=1014;

Kind regards,

Koen

su35 · Posted 05-22-2021 01:13 PM

Thanks Koen.
Your solution is work. But, when I tried different seeds, such as 1234, I got different results. So, what is the role of seed options? What is the rule of select seed?

Thanks
Jun

sbxkoenk · Posted 05-22-2021 02:20 PM

Hello @su35 ,

This is the general definition for a seed in SAS.

seed = an initial value from which a random number function or CALL routine calculates a random value.: In k-fold cross-validation (used in HPSPLIT) the data have to be split in k distinct sets with (about) equal n° of observations.; ( I don't know about the exact value of k in HPSPLIT. , it's not relevant to your question ); This data split in k sets is done using a (pseudo-) random number generator.; The (pseudo-) random number generator uses a strictly positive seed for initialization.; Using the same seed ensures reproducibility of the random number series, using a different seed results in a different set of random numbers.; Using NO seed means the seed will default to the computer clock time which is always different for consecutive runs. That's why you got different results for PROC HPSPLIT in subsequent runs when not using a seed.; Kind regards,
Koen

su35 · Posted 05-22-2021 02:39 PM

If the result is dependent on the seed, is there a general rule to set the seed?

su35 · Posted 05-22-2021 02:46 PM

odd/even?

sbxkoenk · Posted 05-22-2021 03:01 PM

No, there's no general rule to set the seed.

It can be any strictly positive (>0) number.

You just set the seed to get a reproducible result.

But the seed is / should not be important for the final model. I mean, whatever the seed, the resulting models will always be very comparable (not identical but very comparable). At least this is the case if these are good models that capture well the underlying pattern in the data. Hence, the seed is not an important factor, many people just use 12345.

If different seeds result in very different models there's a problem somewhere I would say!!

Of course if there's one family of models that could "suffer" a bit from this seed-selection it is TREES because their response surface is so discrete (not smooth). When your age is X-years minus one day you branch to the left and if your age is X-years you branch to the right and both cases might end up in leaves with a significant difference in (predicted) response value.

Kind regards,

Koen

su35 · Posted 05-22-2021 03:40 PM

I use the proc hpsplit to discretize the interval variables and collapsing the levels of the ordinal and nominal variables. Run the following code
proc hpsplit data=train leafsize=2213 seed=;
model loan_status =mths_since_last_delinq;
output nodestats=hp_tree;
run;
if seed=1113, then the mths_since_last_delinq would be splited to 7 bin. if seed=1111, then the mths_since_last_delinq couldn't split.
Regards,
Jun

sbxkoenk · Posted 05-23-2021 10:00 AM

Hello @su35 ,

That's very weird.

It may happen exceptionally (this 'big' discrepancy between results), but the fact that you just bump into 2 random seeds where this happens is remarkable. Are you sure everything is OK with the data? Do you have enough observations?

Anyway, I would get rid of the cross-validation (CV) as your goal is to just discretize one interval variable (or collapse levels of 1 nominal / ordinal variable). Without CV there's not a seed in the game:

PROC HPSPLIT CVMETHOD=NONE ...;
...
run;

Cheers,

Koen

ballardw · Posted 05-23-2021 10:32 PM

@su35 wrote:
I use the proc hpsplit to discretize the interval variables and collapsing the levels of the ordinal and nominal variables. Run the following code
proc hpsplit data=train leafsize=2213 seed=;
model loan_status =mths_since_last_delinq;
output nodestats=hp_tree;
run;
if seed=1113, then the mths_since_last_delinq would be splited to 7 bin. if seed=1111, then the mths_since_last_delinq couldn't split.
Regards,
Jun

Show LOG from the run you made where it "couldn't split". Copy the text for the entire Proc HPSPLIT plus any notes, warnings or other messages. Then open a text box on the forum with the </> icon and paste the text. The text box is important to preserve text formatting of any diagnostics that SAS places in the log. The message windows on this forum reformat text and may make the diagnostics less useful or hard to read properly.

ballardw · Posted 05-23-2021 10:29 PM

From documentation on using random number functions :

Seed Values

Random-number functions and CALL routines generate streams of pseudo-random numbers from an initial starting point, called a seed, that either the user or the computer clock supplies. A seed must be a nonnegative integer with a value less than 2³¹–1 (or 2,147,483,647). If you use a positive seed, you can always replicate the stream of random numbers by using the same DATA step. If you use zero as the seed, the computer clock initializes the stream, and the stream of random numbers cannot be replicated.

Which value to set is your decision.

su35 · Posted 05-24-2021 01:08 PM

7877   proc hpsplit data=train leafsize=2213 assignmissing=none seed=1111;
7878   model loan_status =mths_since_last_delinq;
7879   output nodestats=work.hp_tree;
7880   run;

NOTE: The HPSPLIT procedure is executing in single-machine mode.
NOTE: Cross-validating using 10 folds.
NOTE: There were 44249 observations read from the data set LOANRISK.TRAIN.
NOTE: The data set WORK.HP_TREE has 1 observations and 25 variables.
NOTE: PROCEDURE HPSPLIT used (Total process time):
      real time           1.36 seconds
      cpu time            0.92 seconds


7881   proc hpsplit data=train leafsize=2213 assignmissing=none seed=1113;
7882   model loan_status =mths_since_last_delinq;
7883   output nodestats=work.hp_tree;
7884   run;

NOTE: The HPSPLIT procedure is executing in single-machine mode.
NOTE: Cross-validating using 10 folds.
NOTE: There were 44249 observations read from the data set LOANRISK.TRAIN.
NOTE: The data set WORK.HP_TREE has 15 observations and 25 variables.
NOTE: PROCEDURE HPSPLIT used (Total process time):
      real time           1.36 seconds
      cpu time            1.00 seconds

From above, we can see that when seed=1111, the work.hp_tree is one obs. But when the seed=1113, there are 15 obs in work.hp_tree.

sbxkoenk · Posted 05-25-2021 06:07 AM

Hello @su35 ,

You are having enough observations ( # 44249 ).

What's the cardinality of the input variable "mths_since_last_delinq"? In other words, how many distinct levels (distinct values) does it have? You can find out with PROC FREQ or PROC SQL or PROC CARDINALITY (latter procedure only exists in VIYA, not in SAS 9.4).

Cheers,

Koen

su35 · Posted 05-26-2021 09:11 AM

The "mths_since_last_delinq" is the counting of months that has 107 distinct levels and 48% missing value. I treat it as an interval value.

sbxkoenk · Posted 05-26-2021 12:53 PM

Hello @su35 ,

OK, 107 distinct levels (+1 level for missing, … I guess these are accounts which never had a delinquency) is enough to consider that variable as an interval input.

Very strange that you stumbled upon 2 random seeds which have so different results.

I guess most of the other seeds you can imagine (any number > 0) will result in solution 1 or solution 2, no?

If 10 more seeds give you the split, then that split should be done.
If these 10 more seeds result in no split, then no split should be done.

But again you can also work without cross-validation (no seed needed and always the same solution unless you are heavily doing distributed processing, then minor differences might be possible).

Heavy distributed processing, like you can do in SAS VIYA, is not always giving you deterministic results.

Cheers,

Koen

Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Seed Values

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Re: Why the output of the proc hpsplit is uncertain

Seed Values

Registration is open