Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- PROC HPSPLIT: is this decision tree tool good for categorising respond...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 02-26-2017 04:03 PM
(7606 views)

Hello,

I was advised to use PROC HPSPLIT to classify/categorise respondents. I have one ordinal DV (delivery rating) and varied IVs (binary, ordinal, nominal and continuous). Can Proc HPSPLIT help in classifying/categorising the respondents into 'delivery rating' based on the available IVs?

I used

proc hpsplit data=hpsplit.data;

class del_rating;

model del_rating = pckg_qual prcl_cond del_time del_loc max_del_time max_pc_wgt pc_count;

grow entropy;

prune costcomplexity (leaves=10);

run;

It of course produced outputs that I could not interpret. For e.g., the 'Classification Tree for Del_Rating' or the 'Subtree starting at Node=0'

I'm aware of CHAID analysis, however, I have never used this procedure before. The first predictor category that CHAID uses to split the sample is the IV that is associated the most with the DV, i.e., it gives the most differentiating groups of respondents. Is it somethign similar?

Could you please help me with choosing the right criterion for deciding on the split? Must I choose the 'leaves'? Or is there an option, wherein it stops producing the 'child' node until the algorithm does not find any significantly discriminating predictor any more?

Apologies for posing such naive questions.

I thank you in advance.

MS

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi!

If I understand what you're trying to do, then I think that Proc HPSPLIT could definitely help you achieve your goal(s).

When creating your Proc HPSPLIT call, every binary, ordinal, nominal variable should be listed in the class statement (HPSPLIT doesn't actually distinquish between nominal and ordinal). I notice you only had the dependent variable in the class statement in your example, which is correct, but I didn't know if you had other non-continuous variables that you wanted to add.

You mentioned that you could not interpret the two tree plots, so I'm going to give background, and hopefully it makes sense in such a short post. The way HPSPLIT works, is that it looks at the dependent variable, and finds the indepedent variable, along with the split, that optimizing the tree growing criterion (in your example, this is the entropy metric). The training observations are then split into two subsets based upon the splitting variable. You should be able to see this first split in the `Subtree Starting at Node=0' diagram (node 0 is the root node, which corresponds to the entire pool of training data).

HPSPLIT then recursively splits each subset of observations again, in a similar manner as the first time. Each time that HPSPLIT performs a split, it looks for the variable and the split that optimizes the growing criterion on the particular subset of the data. If there is no split formed from a node, that is called a leaf node.

If you read from the top of the tree (root node, or node=0) to the bottom of the tree, then you would read a set of rules that might be something like (pc_count < 2, then max_pc_wgt > 25, then prcl_cond = good), leading to a single leaf node. That leaf would have statistics on it, informing you of things, such as the predicted del_rating for any observations in the node, as well as the proportion of each del_rating that occurs in that node (for exampe, del_rating 1 0%, del_rating 2 10%, and del_rating 3 90%).

In general proc HPSPLIT grows until it reaches a maximum depth (default is 10). Then it will prune (remove some splits) based upon cross validation. If you'd like to split fewer times, you can do so with the maxdepth= proc option.

Finally, as far as variable importance goes, HPSPLIT calculates varaible importance as an aggregate throughtout the entire splitting process. The first split might be an important variable, but it could be possible that other variables have a higher impact in subsequent splits.

I hope this helps answer some questions. I'd also recommend the HPSPLIT procedure documentation for more detailed description on interpreting the tree diagrams - it does a better job than I can do in a short forum post.

Please let me know if you have further questions.

-Ralph

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi!

If I understand what you're trying to do, then I think that Proc HPSPLIT could definitely help you achieve your goal(s).

When creating your Proc HPSPLIT call, every binary, ordinal, nominal variable should be listed in the class statement (HPSPLIT doesn't actually distinquish between nominal and ordinal). I notice you only had the dependent variable in the class statement in your example, which is correct, but I didn't know if you had other non-continuous variables that you wanted to add.

You mentioned that you could not interpret the two tree plots, so I'm going to give background, and hopefully it makes sense in such a short post. The way HPSPLIT works, is that it looks at the dependent variable, and finds the indepedent variable, along with the split, that optimizing the tree growing criterion (in your example, this is the entropy metric). The training observations are then split into two subsets based upon the splitting variable. You should be able to see this first split in the `Subtree Starting at Node=0' diagram (node 0 is the root node, which corresponds to the entire pool of training data).

HPSPLIT then recursively splits each subset of observations again, in a similar manner as the first time. Each time that HPSPLIT performs a split, it looks for the variable and the split that optimizes the growing criterion on the particular subset of the data. If there is no split formed from a node, that is called a leaf node.

If you read from the top of the tree (root node, or node=0) to the bottom of the tree, then you would read a set of rules that might be something like (pc_count < 2, then max_pc_wgt > 25, then prcl_cond = good), leading to a single leaf node. That leaf would have statistics on it, informing you of things, such as the predicted del_rating for any observations in the node, as well as the proportion of each del_rating that occurs in that node (for exampe, del_rating 1 0%, del_rating 2 10%, and del_rating 3 90%).

In general proc HPSPLIT grows until it reaches a maximum depth (default is 10). Then it will prune (remove some splits) based upon cross validation. If you'd like to split fewer times, you can do so with the maxdepth= proc option.

Finally, as far as variable importance goes, HPSPLIT calculates varaible importance as an aggregate throughtout the entire splitting process. The first split might be an important variable, but it could be possible that other variables have a higher impact in subsequent splits.

I hope this helps answer some questions. I'd also recommend the HPSPLIT procedure documentation for more detailed description on interpreting the tree diagrams - it does a better job than I can do in a short forum post.

Please let me know if you have further questions.

-Ralph

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello Ralph,

Thank you for the detailed answer.

Yes, I had only mentioned the DV in the CLASS statement and not the other ordinal and nominal variables. Thank you for pointing that out. Also, my leaves did not display any statistics - attached is the output file.

The output had 162 leaves after pruning - I now have 14 IVs as opposed to the former 8 IVs.

Would you recommend using the PRUNEUNTIL= option or the MAXVARIANCE= option? I ask, as I need to present this Tree and was wondering if zooming the tree by NODE= option would help.

The 'Variable Importance' table outputs relative, Importance and Count info pertaining to the IVs. How are they to be interpreted? Could I exclude the variable(s) with lower/lowest relative importance from a subsequent run?

I have SAS 9.4 - is it the reason that my tree is monotone?

What is the difference between using TARGET as opposed to using CLASS with MODEL statement?

Thank you again for your assistance - I really appreciate it.

Regards

Mari

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The zoomed tree plot can be very helpful, but it can also be hard to understand in an overall context, because it offers only a small view of the overall tree. The default is to show a depth of 3 for the zoomed tree (this is just the display depth), but sometimes that can prevent some of the statistics from showing up.

We can modify your original code to have a zoomedtree plot that will have more detail, because it has more space (only plots depth of 2). The other thing we can do is plot multiple zoomed trees based upon different parts of the overall tree.

Changing the depth and adding multiple zoomed tree plots is shown below. Hopefully the depth of 2 instead of the default 3 will show you more statistics on the nodes. Theoretically you could use the `nodes' suboption to create a bunch of zoomed tree plots, and then reconstruct a zoomed version of the entire tree (not something I generally recommend, but I could see cases in which it might actually be needed).

proc hpsplit data=hpsplit.data plots=(zoomedtree(depth=2 nodes=(0 3 4)));

class del_rating;

model del_rating = pckg_qual prcl_cond del_time del_loc max_del_time max_pc_wgt pc_count;

grow entropy;

prune costcomplexity (leaves=10);

run;

Excluding unimportant variables is certainly things that people do. The `Count' importance is just the number of times that variable was used to split a node. The `Importance' is calculated off of the change in the Residual Sum of Squares from before and after node splits. The `Relative Importance' just divides all the importance values by the largest. It becomes useful to exclude unimportant variables in subsequent runs when you start with many variables (sometimes people start with hundreds and limiting the analysis can be very helpful).

The TARGET statement is actually an alternative to the MODEL statement. You would pair the TARGET statement with an INPUT statement, but you cannot use TARGET & INPUT along with CLASS & MODEL.

This is an example for using INPUT and TARGET. This would not change the analysis that is done, but is just an alternate syntax. In this case, you manually specify the level for each group of INPUTS (independent variables), where interval is continuous. Then you specify one TARGET (dependent variable), and the corresponding level.

proc hpsplit data=hpsplit.data plots=(zoomedtree(depth=2 nodes=(0 3 4)));

INPUT <variable 1> <variable 2> / level = interval;

INPUT <variable 3> <variable 4> <variable 5> / level = nominal;

TARGET del_rating / level = nominal;

grow entropy;

prune costcomplexity (leaves=10);

run;

If you're familiar with using CLASS & MODEL then I would recommend staying with that syntax. However if the INPUT & Target syntax makes more sense to you, then it is available.

Hopefully this helps!

-Ralph

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello Ralph,

Thank you very much for the detailed answer.

I believe, I will stick to using CLASS and MODEL if they produce the same results as TARGET.

I have one more question (hopefully, the last) for you.

What do the **marked **values represent?

Regards

Mari

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The marked values are the values at which the split occurs.

Let us consider Node 0, the top of the tree. This represent the full set of observations. The two thick lines descending from Node 0 represent the split of the full set of observations into two smaller groups. That split is determined using the variable "Flav" and the value of 1.572. That is to say, all observations with Flav < 1.572 go into Node 1, while all observations with Flav >= 1.572 go into Node 2.

Node 2 is split again, this time using the variable "Proline" and the value of Proline which is used to split the set of observations is 726.640. Node 3 represents all observations that have Flav >= 1.572 AND Proline < 726.640. Node 3 is composed of 54 observations (This is the number corresponding to "N" on the node), and the Node is 98.15% made up of level 2 (which according to the legend is Cultivar=2).

Ultimately those highlighted numbers represent the value at which the split occurs for the variable (for continous variables). If you have nominal variables, instead of < and >=, the different levels well be indicated on the different splits.

Hopefully this helps!

-Ralph

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello Ralph,

You have been a great help! Thank you for answering (well, at that) all my questions.

Regards

MS

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello Ralph,

One of my non-continuous variables (ordinal with values 1-5) shows values >= 6,240 and <6,240 at which splits happen. How would you explain that?

Regards

Mari

One of my non-continuous variables (ordinal with values 1-5) shows values >= 6,240 and <6,240 at which splits happen. How would you explain that?

Regards

Mari

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello Ralph,

Kindly ignore my question - it was indeed a continuous variable and I confused it with another variable (they were Del_time (real) and Del_time (perceived) - perceived was ordinal whereas real continuous). Apologies for the confusion and thank you for the (again) prompt answer

Have a great day ahead!

Regards

MS

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.