Exhaustive CAHID with proc hpsplit - some questions

CoryJC · Posted 08-26-2021 01:33 PM

Hello everyone, I'm relatively new to classification trees and I was hoping to ask some questions about using PROC HPSPLIT (STAT 13.2) to run exhaustive CHAID.

To give some background, I'm working with a large dataset to model the risk of the dichotomous outcome "ipvcc" based on 3-6 nominal demographic variables, "p4, region, lgbtq," etc. The most recent iteration of my code is below:

ods graphics on;

proc hpsplit data=sv.mssgrade3 event="y" alpha=0.05 bonferroni
maxdepth=30 mincatsize=200 leafsize=50 nodes;
criterion CHAID ;
input p4 / level=nom;
input region / level=nom;
input lgbtq / level=nom;
input ses_proxy / level=nom;
input race / level=nom;
prune none;
target ipvcc;
score out=work.mss_score;
run;

The thing is, I'm running this analysis alongside a colleague using SPSS, and we've been getting different outputs whenever we run the model with five or more independent variables. All of our test data is the same, and we've ruled out differences in how missing variables are treated as the cause of the discrepancy, so we're thinking it has something to do with CHAID options. Her SPSS syntax looks like this:

TREE IPSEXV [n] BY p4 [n] region [n] lgbtq [n] ses_proxy [n] race [n]
/TREE DISPLAY=TOPDOWN NODES=BOTH BRANCHSTATISTICS=YES NODEDEFS=YES SCALE=AUTO
/DEPCATEGORIES USEVALUES=[VALID]
/PRINT MODELSUMMARY CLASSIFICATION RISK TREETABLE
/METHOD TYPE=CHAID
/GROWTHLIMIT MAXDEPTH=30 MINPARENTSIZE=200 MINCHILDSIZE=50
/VALIDATION TYPE=NONE OUTPUT=BOTHSAMPLES
/CHAID ALPHASPLIT=0.05 ALPHAMERGE=0.05 SPLITMERGED=NO CHISQUARE=PEARSON CONVERGE=0.001
MAXITERATIONS=100 ADJUST=BONFERRONI
/COSTS EQUAL
/MISSING NOMINALMISSING=MISSING.

She seems to be in control over more aspects of her procedure than I am. Mainly, in her syntax, it looks like she's specified rules for merging and convergence that aren't explicitly specified in my syntax. My first question is, how can I adjust my syntax so that my settings are similar to my colleague's?

My second question is: I don't think either of us right now are doing exhaustive CHAID right now as opposed to regular CHAID. How can I specify exhaustive CHAID in PROC HPSPLIT?

Any advice helps! Thank you for your time.

sbxkoenk · Posted 08-26-2021 01:56 PM

Hello,

For an exhaustive CHAID, you need the options:

LEVTHRESH1= Specifies the maximum number of computations to perform in an exhaustive search for a categorical predictor
LEVTHRESH2= Specifies the number of computations to perform before the splitter uses the fastest greedy search

See here: https://support.sas.com/documentation/onlinedoc/stat/141/hpsplit.pdf

Read also the section "Splitting Strategy" on pp. 4607-4608.

Good luck,

Koen

Exhaustive CAHID with proc hpsplit - some questions

Re: Exhaustive CAHID with proc hpsplit - some questions