BookmarkSubscribeRSS Feed
blund
Obsidian | Level 7

I’m running SAS on Demand for Academics and I am trying to learn the features of HPSPLIT. So, I ran a very small sample.

The report “2-Fold Cross Validation Assessment of Pruning Parameter” shows the “Selected pruning parameter” to occur at Leaves=1 with Avg Misclassification Rate of 0.5227.

But the smallest Avg Misclassification Rate occurs at Leaves=4 (0.2864).

See the attached PDF which contains the Reports. Am I misreading the table and graphic? … The code to create these reports is given below.

 

data small;

input x y $;

datalines;

4 0

4 0

4 0

4 1

3 0

3 0

3 1

3 1

3 1

3 1

2 0

2 0

2 0

2 1

1 0

1 0

1 1

1 1

1 1

1 1

1 1

;

ods graphics on;

proc hpsplit data = small seed=5 CVCC PLOTS=CVCC

cvmethod=random(2) CVMODELFIT NODES=DETAIL;

class y;

model y = x;

GROW entropy;  

PRUNE costcomplexity;

run;

4 REPLIES 4
Mike_N
SAS Employee

Can you post the log for this program? I'm not reproducing this behavior, but I'm not using SAS on Demand for Academics. 

blund
Obsidian | Level 7

Here is the Log file and Report file (I reran the SAS code today ... same Report)

blund
Obsidian | Level 7

Here is the "About SAS Studio" drop down. Also the SAS code (again)

Mike_N
SAS Employee

Thank you for posting the log. I'm now able to reproduce your results. My machine defaults to using 4 threads for the computation, but I needed to use 2 threads to match your output. Note that you do so using the performance statement as follows:

proc hpsplit data=small seed=5 CVCC PLOTS=CVCC cvmethod=random(2) CVMODELFIT NODES=DETAIL;
	class y;
	model y = x;
	GROW entropy;  
	PRUNE costcomplexity;
	performance nthreads = 2;
run;

I suggest that you raise this issue with SAS technical support (make sure they also use nthreads = 2). I think what is happening is the selected tuning parameter is the one with the smallest average squared error based on cross validation. In the documentation for the 'CVMETHOD= random' statement, it says "The average ASE across the k trees is the cross validation error for that set of trees .... the parameter that has the minimum cross validated error is used as the best parameter value."

 

However, to your point, you are fitting a classification tree, and the error metric for a classification tree is commonly the misclassification rate.  In fact, the documentation for prune statement, says "The error metric is misclassification rate for classification trees". I think it is worth confirming with technical support "that PROC HPSPLIT is working as intended. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 533 views
  • 1 like
  • 2 in conversation