BookmarkSubscribeRSS Feed
CoryJC
Calcite | Level 5

Hello everyone, I'm relatively new to classification trees and I was hoping to ask some questions about using PROC HPSPLIT (STAT 13.2) to run exhaustive CHAID. 

 

To give some background, I'm working with a large dataset to model the risk of the dichotomous outcome "ipvcc" based on 3-6 nominal demographic variables, "p4, region, lgbtq," etc. The most recent iteration of my code is below:

 

ods graphics on;

proc hpsplit data=sv.mssgrade3 event="y" alpha=0.05 bonferroni
maxdepth=30 mincatsize=200 leafsize=50 nodes;
criterion CHAID ;
input p4 / level=nom;
input region / level=nom;
input lgbtq / level=nom;
input ses_proxy / level=nom;
input race / level=nom;
prune none;
target ipvcc;
score out=work.mss_score;
run;

 

The thing is, I'm running this analysis alongside a colleague using SPSS, and we've been getting different outputs whenever we run the model with five or more independent variables. All of our test data is the same, and we've ruled out differences in how missing variables are treated as the cause of the discrepancy, so we're thinking it has something to do with CHAID options. Her SPSS syntax looks like this:

 

TREE IPSEXV [n] BY p4 [n] region [n] lgbtq [n] ses_proxy [n] race [n]
  /TREE DISPLAY=TOPDOWN NODES=BOTH BRANCHSTATISTICS=YES NODEDEFS=YES SCALE=AUTO
  /DEPCATEGORIES USEVALUES=[VALID]
  /PRINT MODELSUMMARY CLASSIFICATION RISK TREETABLE
  /METHOD TYPE=CHAID
  /GROWTHLIMIT MAXDEPTH=30 MINPARENTSIZE=200 MINCHILDSIZE=50
  /VALIDATION TYPE=NONE OUTPUT=BOTHSAMPLES
  /CHAID ALPHASPLIT=0.05 ALPHAMERGE=0.05 SPLITMERGED=NO CHISQUARE=PEARSON CONVERGE=0.001
    MAXITERATIONS=100 ADJUST=BONFERRONI
  /COSTS EQUAL
  /MISSING NOMINALMISSING=MISSING.

 

She seems to be in control over more aspects of her procedure than I am. Mainly, in her syntax, it looks like she's specified rules for merging and convergence that aren't explicitly specified in my syntax. My first question is, how can I adjust my syntax so that my settings are similar to my colleague's? 

 

My second question is: I don't think either of us right now are doing exhaustive CHAID right now as opposed to regular CHAID. How can I specify exhaustive CHAID in PROC HPSPLIT?

 

Any advice helps! Thank you for your time.

1 REPLY 1
sbxkoenk
SAS Super FREQ

Hello,

 

For an exhaustive CHAID, you need the options:

  • LEVTHRESH1= Specifies the maximum number of computations to perform in an exhaustive search for a categorical predictor
  • LEVTHRESH2= Specifies the number of computations to perform before the splitter uses the fastest greedy search

 

See here: https://support.sas.com/documentation/onlinedoc/stat/141/hpsplit.pdf

 

Read also the section "Splitting Strategy" on pp. 4607-4608.

 

Good luck,

Koen

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 343 views
  • 3 likes
  • 2 in conversation