Hey, I am trying to run an HPSPLIT procedure with customizing the predcited = 1 to assign to those observations that have P_FLAG1 >= 0.7 instead of 0.5. Any idea how can I make this happen ? In addition to that I also want to check my outcomes for the cross validation set PARTITION FRACTION(VALIDATE=0.3 SEED=42);
An alternative solution would be if you can help me create a flag which will tell me which observations have been used for training and which ones for cross validation as the output data doesn't show it.
Thanks,
Hello @artyomkosyan and welcome to the SAS Support Communities!
Let me first say that I have very little experience with PROC HPSPLIT.
Does the last section of Example 67.1 Building a Classification Tree for a Binary Outcome (scroll down to the bottom of the page) answer your first question? In that example the probability cutoff is changed from 0.5 to 0.1.
As to your second question, I was surprised to see that, indeed, the OUT= output dataset (or any other output dataset) does not contain the information about the training-validation split. But a naïve guess was successful with the HMEQ sample dataset used in Example 67.4 Creating a Binary Classification Tree with Validation Data:
I created an input dataset HMEQ2 containing the split information
data hmeq2;
call streaminit(123);
set hmeq;
_partind_=rand('bern',0.3);
run;
and compared the results (both printed output and all three output files) of the original code from Example 67.4 (with an OUTPUT statement added)
proc hpsplit data=hmeq maxdepth=5; class Bad Delinq Derog Job nInq Reason; model Bad(event='1') = Delinq Derog Job nInq Reason CLAge CLNo DebtInc Loan MortDue Value YoJ; prune costcomplexity; partition fraction(validate=0.3 seed=123); code file='hpsplexc.sas'; rules file='rules.txt'; output out=hpsout; run;
to those obtained with the new input dataset and a correspondingly adapted PARTITION statement
proc hpsplit data=hmeq2 maxdepth=5; class Bad Delinq Derog Job nInq Reason; model Bad(event='1') = Delinq Derog Job nInq Reason CLAge CLNo DebtInc Loan MortDue Value YoJ; prune costcomplexity; partition rolevar=_partind_(TRAIN='0' VALIDATE='1'); code file='hpsplexc2.sas'; rules file='rules2.txt'; output out=hpsout2; run; proc compare data=hpsout c=hpsout2; run;
The results were exactly identical in my single-machine environment (using 4 threads by default) using SAS/STAT 14.3. (I verified this also with a few other split probabilities and seed values.)
But to be 100% sure about which observations were used for training and which for validation you can, of course, use a dataset (like HMEQ2) containing a variable such as _PARTIND_ above in the first place.
Hello @artyomkosyan and welcome to the SAS Support Communities!
Let me first say that I have very little experience with PROC HPSPLIT.
Does the last section of Example 67.1 Building a Classification Tree for a Binary Outcome (scroll down to the bottom of the page) answer your first question? In that example the probability cutoff is changed from 0.5 to 0.1.
As to your second question, I was surprised to see that, indeed, the OUT= output dataset (or any other output dataset) does not contain the information about the training-validation split. But a naïve guess was successful with the HMEQ sample dataset used in Example 67.4 Creating a Binary Classification Tree with Validation Data:
I created an input dataset HMEQ2 containing the split information
data hmeq2;
call streaminit(123);
set hmeq;
_partind_=rand('bern',0.3);
run;
and compared the results (both printed output and all three output files) of the original code from Example 67.4 (with an OUTPUT statement added)
proc hpsplit data=hmeq maxdepth=5; class Bad Delinq Derog Job nInq Reason; model Bad(event='1') = Delinq Derog Job nInq Reason CLAge CLNo DebtInc Loan MortDue Value YoJ; prune costcomplexity; partition fraction(validate=0.3 seed=123); code file='hpsplexc.sas'; rules file='rules.txt'; output out=hpsout; run;
to those obtained with the new input dataset and a correspondingly adapted PARTITION statement
proc hpsplit data=hmeq2 maxdepth=5; class Bad Delinq Derog Job nInq Reason; model Bad(event='1') = Delinq Derog Job nInq Reason CLAge CLNo DebtInc Loan MortDue Value YoJ; prune costcomplexity; partition rolevar=_partind_(TRAIN='0' VALIDATE='1'); code file='hpsplexc2.sas'; rules file='rules2.txt'; output out=hpsout2; run; proc compare data=hpsout c=hpsout2; run;
The results were exactly identical in my single-machine environment (using 4 threads by default) using SAS/STAT 14.3. (I verified this also with a few other split probabilities and seed values.)
But to be 100% sure about which observations were used for training and which for validation you can, of course, use a dataset (like HMEQ2) containing a variable such as _PARTIND_ above in the first place.
Thanks @FreelanceReinh this is helpful.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.