BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Vi_
Fluorite | Level 6 Vi_
Fluorite | Level 6

Hi, 

 

if specific output nodestates= option in Proc HPSPLIT, it will give you a table that I think is the key to generate the tree rule.

 

Basically, I need a code that can read like when Node(ID column)=3, parent node (PARENT column)=1, go back to ID column and find the rule (DECISION column) for ID=1 recursively until reach root node.

 

image.png

 

Any suggestion?

 

(I am using Enterprise Guide 7.1)

 

Thank you in advance,

 

Vi

 

 

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Understood. This code fragment might be useful. I did this some time ago to format the rules from a binary tree (it probably doesn't work for just any decision tree)

 

proc hpsplit...;
...;
output nodestats = SplitNS;
...;
run;

proc sql;
create unique index Id on SplitNS (Id);
quit;

data Leaves;
set SplitNS;
if not missing(leaf) then do;
    leafNo = leaf;
    leafPop = n;
    value = predictedvalue;
    output;
    id = parent;
    do while (id >= 0);
        set SplitNS key=Id/unique;
        output;
        id = parent;
        end;
    end;
keep leafNo depth leafPop id decision insplitvar value;
run;

proc sort data=Leaves; by leafNo leafPop inSplitVar depth; run;

data LeafText;
length ds str $128;
do until(last.inSplitVar);
    set Leaves; by leafNo inSplitVar;
    /* Remove "or Missing" from decision since there were no missing value in data (optional) */
    decision = left(prxChange('s/or Missing//o', -1, decision));
    select (first(decision));
        when ('<') lt = min(lt, input(scan(substr(decision,2),1," "),best.));
        when ('>') ge = max(ge, input(scan(substr(decision,3),1," "),best.));
        otherwise if missing(ds) or length(ds) > length(decision) then ds = decision;
        end;
    end;

if cmiss(ge, lt) = 0    then str = catx(" - ", ge, lt);
else if not missing(ge) then str = catx(" ", ge, "-");
else if not missing(lt) then str = catx(" ", "-", lt);
else if not missing(ds) then str = ds;

keep leafNo leafPop value inSplitVar str; 
run;

proc transpose data=LeafText out=LeafCond(drop=_name_);
where inSplitVar is not missing;
by leafNo leafPop value;
id inSplitVar;
var str;
run;
PG

View solution in original post

5 REPLIES 5
PGStats
Opal | Level 21

HPSPLIT now has the RULES statement that creates a text version of the rules that define the leaves of the final tree

 

http://documentation.sas.com/?docsetId=statug&docsetTarget=statug_hpsplit_syntax11.htm&docsetVersion...

PG
Vi_
Fluorite | Level 6 Vi_
Fluorite | Level 6
Hi PGStats,

Thank you for your response. I did try to use rule file but I have issue calling the txt file from my sas server library because at the end, the rule files have to be included in my pdf report. Additionally, since I am only interested in certain rules, I think if I can create my own tree rules from Nodestates file, it would be easier.


( In my case, I have 72 rule files needed to include in pdf report and I’ll have to manually clean the rules I don’t need and copy paste into pdf if I did not figure out any easier way to do it).


PGStats
Opal | Level 21

Understood. This code fragment might be useful. I did this some time ago to format the rules from a binary tree (it probably doesn't work for just any decision tree)

 

proc hpsplit...;
...;
output nodestats = SplitNS;
...;
run;

proc sql;
create unique index Id on SplitNS (Id);
quit;

data Leaves;
set SplitNS;
if not missing(leaf) then do;
    leafNo = leaf;
    leafPop = n;
    value = predictedvalue;
    output;
    id = parent;
    do while (id >= 0);
        set SplitNS key=Id/unique;
        output;
        id = parent;
        end;
    end;
keep leafNo depth leafPop id decision insplitvar value;
run;

proc sort data=Leaves; by leafNo leafPop inSplitVar depth; run;

data LeafText;
length ds str $128;
do until(last.inSplitVar);
    set Leaves; by leafNo inSplitVar;
    /* Remove "or Missing" from decision since there were no missing value in data (optional) */
    decision = left(prxChange('s/or Missing//o', -1, decision));
    select (first(decision));
        when ('<') lt = min(lt, input(scan(substr(decision,2),1," "),best.));
        when ('>') ge = max(ge, input(scan(substr(decision,3),1," "),best.));
        otherwise if missing(ds) or length(ds) > length(decision) then ds = decision;
        end;
    end;

if cmiss(ge, lt) = 0    then str = catx(" - ", ge, lt);
else if not missing(ge) then str = catx(" ", ge, "-");
else if not missing(lt) then str = catx(" ", "-", lt);
else if not missing(ds) then str = ds;

keep leafNo leafPop value inSplitVar str; 
run;

proc transpose data=LeafText out=LeafCond(drop=_name_);
where inSplitVar is not missing;
by leafNo leafPop value;
id inSplitVar;
var str;
run;
PG
Vi_
Fluorite | Level 6 Vi_
Fluorite | Level 6

Hi PG,

 

Thank you and your code works in my case. But I have some questions and hopefully you can help me on them. 

 

1. why create unique index on ID column. Aren't they already unique?

2. I don't fully understand this part of code:  (why when ID=Parent do something? and what's the code doing under this part?)

    id = parent;
    do while (id >= 0);
        set Node key=Id/unique;
        output;
        id = parent;
        end;

 

Thank you so much,

 

Vi

 

PGStats
Opal | Level 21

1) The index is created to allow random access to the parent records.

 

2) This is precisely where the index comes in. This loop goes up the parent list (from the leaf to the root of the tree) and outputs every record along the way.The code

 

id = parent;
set Node key=Id/unique;

replaces the current record with the parent of the current record.

 

PG

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 2052 views
  • 0 likes
  • 2 in conversation