BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
artyomkosyan
Calcite | Level 5

Hey, I am trying to run an HPSPLIT procedure with customizing the predcited = 1 to assign to those observations that have P_FLAG1 >= 0.7 instead of 0.5. Any idea how can I make this happen ?  In addition to that I also want to check my outcomes for the cross validation set  PARTITION FRACTION(VALIDATE=0.3 SEED=42); 

 

An alternative solution would be if you can help me create a flag which will tell me which observations have been used for training and which ones for cross validation as the output data doesn't show it.

artyomkosyan_0-1664461435485.png

artyomkosyan_1-1664461492096.png

 

 

 

 

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @artyomkosyan and welcome to the SAS Support Communities!

 

Let me first say that I have very little experience with PROC HPSPLIT.

 

Does the last section of Example 67.1 Building a Classification Tree for a Binary Outcome (scroll down to the bottom of the page) answer your first question? In that example the probability cutoff is changed from 0.5 to 0.1.

 

As to your second question, I was surprised to see that, indeed, the OUT= output dataset (or any other output dataset) does not contain the information about the training-validation split. But a naïve guess was successful with the HMEQ sample dataset used in Example 67.4 Creating a Binary Classification Tree with Validation Data:

 

I created an input dataset HMEQ2 containing the split information

data hmeq2;
call streaminit(123);
set hmeq;
_partind_=rand('bern',0.3);
run;

and compared the results (both printed output and all three output files) of the original code from Example 67.4 (with an OUTPUT statement added)

proc hpsplit data=hmeq maxdepth=5;
   class Bad Delinq Derog Job nInq Reason;
   model Bad(event='1') = Delinq Derog Job nInq Reason CLAge CLNo
               DebtInc Loan MortDue Value YoJ;
   prune costcomplexity;
   partition fraction(validate=0.3 seed=123);
   code file='hpsplexc.sas';
   rules file='rules.txt';
   output out=hpsout;
run;

to those obtained with the new input dataset and a correspondingly adapted PARTITION statement

proc hpsplit data=hmeq2 maxdepth=5;
   class Bad Delinq Derog Job nInq Reason;
   model Bad(event='1') = Delinq Derog Job nInq Reason CLAge CLNo
               DebtInc Loan MortDue Value YoJ;
   prune costcomplexity;
   partition rolevar=_partind_(TRAIN='0' VALIDATE='1');
   code file='hpsplexc2.sas';
   rules file='rules2.txt';
   output out=hpsout2;
run;

proc compare data=hpsout c=hpsout2;
run;

The results were exactly identical in my single-machine environment (using 4 threads by default) using SAS/STAT 14.3. (I verified this also with a few other split probabilities and seed values.)

 

But to be 100% sure about which observations were used for training and which for validation you can, of course, use a dataset (like HMEQ2) containing a variable such as _PARTIND_ above in the first place.

 

 

 

View solution in original post

2 REPLIES 2
FreelanceReinh
Jade | Level 19

Hello @artyomkosyan and welcome to the SAS Support Communities!

 

Let me first say that I have very little experience with PROC HPSPLIT.

 

Does the last section of Example 67.1 Building a Classification Tree for a Binary Outcome (scroll down to the bottom of the page) answer your first question? In that example the probability cutoff is changed from 0.5 to 0.1.

 

As to your second question, I was surprised to see that, indeed, the OUT= output dataset (or any other output dataset) does not contain the information about the training-validation split. But a naïve guess was successful with the HMEQ sample dataset used in Example 67.4 Creating a Binary Classification Tree with Validation Data:

 

I created an input dataset HMEQ2 containing the split information

data hmeq2;
call streaminit(123);
set hmeq;
_partind_=rand('bern',0.3);
run;

and compared the results (both printed output and all three output files) of the original code from Example 67.4 (with an OUTPUT statement added)

proc hpsplit data=hmeq maxdepth=5;
   class Bad Delinq Derog Job nInq Reason;
   model Bad(event='1') = Delinq Derog Job nInq Reason CLAge CLNo
               DebtInc Loan MortDue Value YoJ;
   prune costcomplexity;
   partition fraction(validate=0.3 seed=123);
   code file='hpsplexc.sas';
   rules file='rules.txt';
   output out=hpsout;
run;

to those obtained with the new input dataset and a correspondingly adapted PARTITION statement

proc hpsplit data=hmeq2 maxdepth=5;
   class Bad Delinq Derog Job nInq Reason;
   model Bad(event='1') = Delinq Derog Job nInq Reason CLAge CLNo
               DebtInc Loan MortDue Value YoJ;
   prune costcomplexity;
   partition rolevar=_partind_(TRAIN='0' VALIDATE='1');
   code file='hpsplexc2.sas';
   rules file='rules2.txt';
   output out=hpsout2;
run;

proc compare data=hpsout c=hpsout2;
run;

The results were exactly identical in my single-machine environment (using 4 threads by default) using SAS/STAT 14.3. (I verified this also with a few other split probabilities and seed values.)

 

But to be 100% sure about which observations were used for training and which for validation you can, of course, use a dataset (like HMEQ2) containing a variable such as _PARTIND_ above in the first place.

 

 

 

artyomkosyan
Calcite | Level 5

Thanks @FreelanceReinh  this is helpful. 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 598 views
  • 2 likes
  • 2 in conversation