Re: Unexpected Results from PROC HPSPLIT ... when run on a small sampl...

blund · Posted 03-22-2024 06:11 PM

I’m running SAS on Demand for Academics and I am trying to learn the features of HPSPLIT. So, I ran a very small sample.

The report “2-Fold Cross Validation Assessment of Pruning Parameter” shows the “Selected pruning parameter” to occur at Leaves=1 with Avg Misclassification Rate of 0.5227.

But the smallest Avg Misclassification Rate occurs at Leaves=4 (0.2864).

See the attached PDF which contains the Reports. Am I misreading the table and graphic? … The code to create these reports is given below.

data small;

input x y $;

datalines;

4 0

4 1

3 0

3 1

2 0

2 1

1 0

1 1

;

ods graphics on;

proc hpsplit data = small seed=5 CVCC PLOTS=CVCC

cvmethod=random(2) CVMODELFIT NODES=DETAIL;

class y;

model y = x;

GROW entropy;

PRUNE costcomplexity;

run;

Mike_N · Posted 03-26-2024 12:22 PM

Can you post the log for this program? I'm not reproducing this behavior, but I'm not using SAS on Demand for Academics.

blund · Posted 03-26-2024 02:21 PM

Here is the Log file and Report file (I reran the SAS code today ... same Report)

blund · Posted 03-26-2024 02:23 PM

Here is the "About SAS Studio" drop down. Also the SAS code (again)

Mike_N · Posted 03-28-2024 03:11 PM

Thank you for posting the log. I'm now able to reproduce your results. My machine defaults to using 4 threads for the computation, but I needed to use 2 threads to match your output. Note that you do so using the performance statement as follows:

proc hpsplit data=small seed=5 CVCC PLOTS=CVCC cvmethod=random(2) CVMODELFIT NODES=DETAIL;
	class y;
	model y = x;
	GROW entropy;  
	PRUNE costcomplexity;
	performance nthreads = 2;
run;

I suggest that you raise this issue with SAS technical support (make sure they also use nthreads = 2). I think what is happening is the selected tuning parameter is the one with the smallest average squared error based on cross validation. In the documentation for the 'CVMETHOD= random' statement, it says "The average ASE across the k trees is the cross validation error for that set of trees .... the parameter that has the minimum cross validated error is used as the best parameter value."

However, to your point, you are fitting a classification tree, and the error metric for a classification tree is commonly the misclassification rate. In fact, the documentation for prune statement, says "The error metric is misclassification rate for classification trees". I think it is worth confirming with technical support "that PROC HPSPLIT is working as intended.

Unexpected Results from PROC HPSPLIT ... when run on a small sample

Re: Unexpected Results from PROC HPSPLIT ... when run on a small sample

Re: Unexpected Results from PROC HPSPLIT ... when run on a small sample

Re: Unexpected Results from PROC HPSPLIT ... when run on a small sample

Re: Unexpected Results from PROC HPSPLIT ... when run on a small sample

Unexpected Results from PROC HPSPLIT ... when run on a small sample

Re: Unexpected Results from PROC HPSPLIT ... when run on a small sample

Re: Unexpected Results from PROC HPSPLIT ... when run on a small sample

Re: Unexpected Results from PROC HPSPLIT ... when run on a small sample

Re: Unexpected Results from PROC HPSPLIT ... when run on a small sample

SAS Innovate 2025: Save the Date