I run the following code several times and got different output. The SAS/STAT version is 15.1. Does the nodestats= is incompatible with this version?
proc hpsplit data=train leafsize=2213;
model loan_status =mths_since_last_delinq;
output nodestats=hp_tree;
run;
Hello @su35 ,
OK, 107 distinct levels (+1 level for missing, … I guess these are accounts which never had a delinquency) is enough to consider that variable as an interval input.
Very strange that you stumbled upon 2 random seeds which have so different results.
I guess most of the other seeds you can imagine (any number > 0) will result in solution 1 or solution 2, no?
But again you can also work without cross-validation (no seed needed and always the same solution unless you are heavily doing distributed processing, then minor differences might be possible).
Heavy distributed processing, like you can do in SAS VIYA, is not always giving you deterministic results.
Cheers,
Koen
Hello,
Which version of SAS are you using? Find out by submitting:
%PUT &=sysvlong;
I suppose you will get always the same result if you specify a seed:
SEED= |
Specifies the random number seed to use for cross validation |
like
proc hpsplit data=train leafsize=2213 seed=1014;
Kind regards,
Koen
Hello @su35 ,
This is the general definition for a seed in SAS.
No, there's no general rule to set the seed.
It can be any strictly positive (>0) number.
You just set the seed to get a reproducible result.
But the seed is / should not be important for the final model. I mean, whatever the seed, the resulting models will always be very comparable (not identical but very comparable). At least this is the case if these are good models that capture well the underlying pattern in the data. Hence, the seed is not an important factor, many people just use 12345.
If different seeds result in very different models there's a problem somewhere I would say!!
Of course if there's one family of models that could "suffer" a bit from this seed-selection it is TREES because their response surface is so discrete (not smooth). When your age is X-years minus one day you branch to the left and if your age is X-years you branch to the right and both cases might end up in leaves with a significant difference in (predicted) response value.
Kind regards,
Koen
Hello @su35 ,
That's very weird.
It may happen exceptionally (this 'big' discrepancy between results), but the fact that you just bump into 2 random seeds where this happens is remarkable. Are you sure everything is OK with the data? Do you have enough observations?
Anyway, I would get rid of the cross-validation (CV) as your goal is to just discretize one interval variable (or collapse levels of 1 nominal / ordinal variable). Without CV there's not a seed in the game:
PROC HPSPLIT CVMETHOD=NONE ...;
...
run;
Cheers,
Koen
@su35 wrote:
I use the proc hpsplit to discretize the interval variables and collapsing the levels of the ordinal and nominal variables. Run the following code
proc hpsplit data=train leafsize=2213 seed=;
model loan_status =mths_since_last_delinq;
output nodestats=hp_tree;
run;
if seed=1113, then the mths_since_last_delinq would be splited to 7 bin. if seed=1111, then the mths_since_last_delinq couldn't split.
Regards,
Jun
Show LOG from the run you made where it "couldn't split". Copy the text for the entire Proc HPSPLIT plus any notes, warnings or other messages. Then open a text box on the forum with the </> icon and paste the text. The text box is important to preserve text formatting of any diagnostics that SAS places in the log. The message windows on this forum reformat text and may make the diagnostics less useful or hard to read properly.
From documentation on using random number functions :
Seed Values
Random-number functions and CALL routines generate streams of pseudo-random numbers from an initial starting point, called a seed, that either the user or the computer clock supplies. A seed must be a nonnegative integer with a value less than 231–1 (or 2,147,483,647). If you use a positive seed, you can always replicate the stream of random numbers by using the same DATA step. If you use zero as the seed, the computer clock initializes the stream, and the stream of random numbers cannot be replicated.
Which value to set is your decision.
7877 proc hpsplit data=train leafsize=2213 assignmissing=none seed=1111; 7878 model loan_status =mths_since_last_delinq; 7879 output nodestats=work.hp_tree; 7880 run; NOTE: The HPSPLIT procedure is executing in single-machine mode. NOTE: Cross-validating using 10 folds. NOTE: There were 44249 observations read from the data set LOANRISK.TRAIN. NOTE: The data set WORK.HP_TREE has 1 observations and 25 variables. NOTE: PROCEDURE HPSPLIT used (Total process time): real time 1.36 seconds cpu time 0.92 seconds 7881 proc hpsplit data=train leafsize=2213 assignmissing=none seed=1113; 7882 model loan_status =mths_since_last_delinq; 7883 output nodestats=work.hp_tree; 7884 run; NOTE: The HPSPLIT procedure is executing in single-machine mode. NOTE: Cross-validating using 10 folds. NOTE: There were 44249 observations read from the data set LOANRISK.TRAIN. NOTE: The data set WORK.HP_TREE has 15 observations and 25 variables. NOTE: PROCEDURE HPSPLIT used (Total process time): real time 1.36 seconds cpu time 1.00 seconds
From above, we can see that when seed=1111, the work.hp_tree is one obs. But when the seed=1113, there are 15 obs in work.hp_tree.
Hello @su35 ,
You are having enough observations ( # 44249 ).
What's the cardinality of the input variable "mths_since_last_delinq"? In other words, how many distinct levels (distinct values) does it have? You can find out with PROC FREQ or PROC SQL or PROC CARDINALITY (latter procedure only exists in VIYA, not in SAS 9.4).
Cheers,
Koen
Hello @su35 ,
OK, 107 distinct levels (+1 level for missing, … I guess these are accounts which never had a delinquency) is enough to consider that variable as an interval input.
Very strange that you stumbled upon 2 random seeds which have so different results.
I guess most of the other seeds you can imagine (any number > 0) will result in solution 1 or solution 2, no?
But again you can also work without cross-validation (no seed needed and always the same solution unless you are heavily doing distributed processing, then minor differences might be possible).
Heavy distributed processing, like you can do in SAS VIYA, is not always giving you deterministic results.
Cheers,
Koen
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.