BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
su35
Obsidian | Level 7

I run the following code several times and got different output. The SAS/STAT version is 15.1. Does the nodestats= is incompatible with this version?

 

proc hpsplit data=train leafsize=2213;
model loan_status =mths_since_last_delinq;
output nodestats=hp_tree;
run;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Hello @su35 ,

 

OK, 107 distinct levels (+1 level for missing, … I guess these are accounts which never had a delinquency) is enough to consider that variable as an interval input.

 

Very strange that you stumbled upon 2 random seeds which have so different results. 

I guess most of the other seeds you can imagine (any number > 0) will result in solution 1 or solution 2, no?

  • If 10 more seeds give you the split, then that split should be done.
  • If these 10 more seeds result in no split, then no split should be done.

But again you can also work without cross-validation (no seed needed and always the same solution unless you are heavily doing distributed processing, then minor differences might be possible).

Heavy distributed processing, like you can do in SAS VIYA, is not always giving you deterministic results.

 

Cheers,

Koen

View solution in original post

15 REPLIES 15
sbxkoenk
SAS Super FREQ

Hello,

Which version of SAS are you using? Find out by submitting:

%PUT &=sysvlong;

 

I suppose you will get always the same result if you specify a seed:

SEED= 

Specifies the random number seed to use for cross validation

like

proc hpsplit data=train leafsize=2213 seed=1014;

 

Kind regards,

Koen

su35
Obsidian | Level 7
Thanks Koen.
Your solution is work. But, when I tried different seeds, such as 1234, I got different results. So, what is the role of seed options? What is the rule of select seed?

Thanks
Jun

sbxkoenk
SAS Super FREQ

Hello @su35 ,

 

This is the general definition for a seed in SAS.

seed = an initial value from which a random number function or CALL routine calculates a random value.
In k-fold cross-validation (used in HPSPLIT) the data have to be split in k distinct sets with (about) equal n° of observations.
( I don't know about the exact value of k in HPSPLIT. , it's not relevant to your question )
This data split in k sets is done using a (pseudo-) random number generator.
The (pseudo-) random number generator uses a strictly positive seed for initialization.
Using the same seed ensures reproducibility of the random number series, using a different seed results in a different set of random numbers.
Using NO seed means the seed will default to the computer clock time which is always different for consecutive runs. That's why you got different results for PROC HPSPLIT in subsequent runs when not using a seed.
Kind regards,
Koen
su35
Obsidian | Level 7
If the result is dependent on the seed, is there a general rule to set the seed?
su35
Obsidian | Level 7
odd/even?
sbxkoenk
SAS Super FREQ

No, there's no general rule to set the seed.

It can be any strictly positive (>0) number.

You just set the seed to get a reproducible result.

But the seed is / should not be important for the final model. I mean, whatever the seed, the resulting models will always be very comparable (not identical but very comparable). At least this is the case if these are good models that capture well the underlying pattern in the data. Hence, the seed is not an important factor, many people just use 12345.

If different seeds result in very different models there's a problem somewhere I would say!!

Of course if there's one family of models that could "suffer" a bit from this seed-selection it is TREES because their response surface is so discrete (not smooth). When your age is X-years minus one day you branch to the left and if your age is X-years you branch to the right and both cases might end up in leaves with a significant difference in (predicted) response value.

Kind regards,

Koen

su35
Obsidian | Level 7
I use the proc hpsplit to discretize the interval variables and collapsing the levels of the ordinal and nominal variables. Run the following code
proc hpsplit data=train leafsize=2213 seed=;
model loan_status =mths_since_last_delinq;
output nodestats=hp_tree;
run;
if seed=1113, then the mths_since_last_delinq would be splited to 7 bin. if seed=1111, then the mths_since_last_delinq couldn't split.
Regards,
Jun
sbxkoenk
SAS Super FREQ

Hello @su35 ,

 

That's very weird. 

It may happen exceptionally (this 'big' discrepancy between results), but the fact that you just bump into 2 random seeds where this happens is remarkable. Are you sure everything is OK with the data? Do you have enough observations?

 

Anyway, I would get rid of the cross-validation (CV) as your goal is to just discretize one interval variable (or collapse levels of 1 nominal / ordinal variable). Without CV there's not a seed in the game:

PROC HPSPLIT CVMETHOD=NONE ...;
...
run;

Cheers,

Koen

ballardw
Super User

@su35 wrote:
I use the proc hpsplit to discretize the interval variables and collapsing the levels of the ordinal and nominal variables. Run the following code
proc hpsplit data=train leafsize=2213 seed=;
model loan_status =mths_since_last_delinq;
output nodestats=hp_tree;
run;
if seed=1113, then the mths_since_last_delinq would be splited to 7 bin. if seed=1111, then the mths_since_last_delinq couldn't split.
Regards,
Jun

Show LOG from the run you made where it "couldn't split". Copy the text for the entire Proc HPSPLIT plus any notes, warnings or other messages. Then open a text box on the forum with the </> icon and paste the text. The text box is important to preserve text formatting of any diagnostics that SAS places in the log. The message windows on this forum reformat text and may make the diagnostics less useful or hard to read properly.

ballardw
Super User

From documentation on using random number functions :

Seed Values

Random-number functions and CALL routines generate streams of pseudo-random numbers from an initial starting point, called a seed, that either the user or the computer clock supplies. A seed must be a nonnegative integer with a value less than 231–1 (or 2,147,483,647). If you use a positive seed, you can always replicate the stream of random numbers by using the same DATA step. If you use zero as the seed, the computer clock initializes the stream, and the stream of random numbers cannot be replicated.

 

 

Which value to set is your decision.


 

 

su35
Obsidian | Level 7
7877   proc hpsplit data=train leafsize=2213 assignmissing=none seed=1111;
7878   model loan_status =mths_since_last_delinq;
7879   output nodestats=work.hp_tree;
7880   run;

NOTE: The HPSPLIT procedure is executing in single-machine mode.
NOTE: Cross-validating using 10 folds.
NOTE: There were 44249 observations read from the data set LOANRISK.TRAIN.
NOTE: The data set WORK.HP_TREE has 1 observations and 25 variables.
NOTE: PROCEDURE HPSPLIT used (Total process time):
      real time           1.36 seconds
      cpu time            0.92 seconds


7881   proc hpsplit data=train leafsize=2213 assignmissing=none seed=1113;
7882   model loan_status =mths_since_last_delinq;
7883   output nodestats=work.hp_tree;
7884   run;

NOTE: The HPSPLIT procedure is executing in single-machine mode.
NOTE: Cross-validating using 10 folds.
NOTE: There were 44249 observations read from the data set LOANRISK.TRAIN.
NOTE: The data set WORK.HP_TREE has 15 observations and 25 variables.
NOTE: PROCEDURE HPSPLIT used (Total process time):
      real time           1.36 seconds
      cpu time            1.00 seconds

From above, we can see that when seed=1111, the work.hp_tree is one obs. But when the seed=1113, there are 15 obs in work.hp_tree.

sbxkoenk
SAS Super FREQ

Hello @su35 ,

 

You are having enough observations ( # 44249 ).

What's the cardinality of the input variable "mths_since_last_delinq"? In other words, how many distinct levels (distinct values) does it have? You can find out with PROC FREQ or PROC SQL or PROC CARDINALITY (latter procedure only exists in VIYA, not in SAS 9.4).

 

Cheers,

Koen

su35
Obsidian | Level 7
The "mths_since_last_delinq" is the counting of months that has 107 distinct levels and 48% missing value. I treat it as an interval value.
sbxkoenk
SAS Super FREQ

Hello @su35 ,

 

OK, 107 distinct levels (+1 level for missing, … I guess these are accounts which never had a delinquency) is enough to consider that variable as an interval input.

 

Very strange that you stumbled upon 2 random seeds which have so different results. 

I guess most of the other seeds you can imagine (any number > 0) will result in solution 1 or solution 2, no?

  • If 10 more seeds give you the split, then that split should be done.
  • If these 10 more seeds result in no split, then no split should be done.

But again you can also work without cross-validation (no seed needed and always the same solution unless you are heavily doing distributed processing, then minor differences might be possible).

Heavy distributed processing, like you can do in SAS VIYA, is not always giving you deterministic results.

 

Cheers,

Koen

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 4796 views
  • 0 likes
  • 3 in conversation