I’m running SAS on Demand for Academics and I am trying to learn the features of HPSPLIT. So, I ran a very small sample.
The report “2-Fold Cross Validation Assessment of Pruning Parameter” shows the “Selected pruning parameter” to occur at Leaves=1 with Avg Misclassification Rate of 0.5227.
But the smallest Avg Misclassification Rate occurs at Leaves=4 (0.2864).
See the attached PDF which contains the Reports. Am I misreading the table and graphic? … The code to create these reports is given below.
data small;
input x y $;
datalines;
4 0
4 0
4 0
4 1
3 0
3 0
3 1
3 1
3 1
3 1
2 0
2 0
2 0
2 1
1 0
1 0
1 1
1 1
1 1
1 1
1 1
;
ods graphics on;
proc hpsplit data = small seed=5 CVCC PLOTS=CVCC
cvmethod=random(2) CVMODELFIT NODES=DETAIL;
class y;
model y = x;
GROW entropy;
PRUNE costcomplexity;
run;
Can you post the log for this program? I'm not reproducing this behavior, but I'm not using SAS on Demand for Academics.
Here is the Log file and Report file (I reran the SAS code today ... same Report)
Here is the "About SAS Studio" drop down. Also the SAS code (again)
Thank you for posting the log. I'm now able to reproduce your results. My machine defaults to using 4 threads for the computation, but I needed to use 2 threads to match your output. Note that you do so using the performance statement as follows:
proc hpsplit data=small seed=5 CVCC PLOTS=CVCC cvmethod=random(2) CVMODELFIT NODES=DETAIL;
class y;
model y = x;
GROW entropy;
PRUNE costcomplexity;
performance nthreads = 2;
run;
I suggest that you raise this issue with SAS technical support (make sure they also use nthreads = 2). I think what is happening is the selected tuning parameter is the one with the smallest average squared error based on cross validation. In the documentation for the 'CVMETHOD= random' statement, it says "The average ASE across the k trees is the cross validation error for that set of trees .... the parameter that has the minimum cross validated error is used as the best parameter value."
However, to your point, you are fitting a classification tree, and the error metric for a classification tree is commonly the misclassification rate. In fact, the documentation for prune statement, says "The error metric is misclassification rate for classification trees". I think it is worth confirming with technical support "that PROC HPSPLIT is working as intended.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.