Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Variables in Random Forests in SAS EM

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 13
Accepted Solution

Variables in Random Forests in SAS EM

Hi guys, 

 

I was hoping one of you could help me with this. I am familiar with Random Forests and how they operate (well, theoretically) but I'm struggling with the forests in E-miner. I want to model a certain dataset, but there are a few variables in the dataset that I don't want the model to use (i.e. I've discovered that - after runnning some exploratory nodes - some cause high correlations whilst others are unstable). I have dragged a metadata node into the diagram which I ran before the random forest node. In this metadata node, I set the role of these variables to "rejected". Additionally, just to be safe, when I right-click on the random forest node and select the "edit variables" option, I also set the "Use" colomn for these variables to "No" in the node itself. When I view the results, I see that these variables are not listed under "Variable importance" and they are also not listed when you select "view -> properties -> variables". Awesome! However, when I extract the scoring code to run the forest in Base SAS, I see that a lot of these variables that I don't want in the model are still run in the SAS code? Is this just the way the random forest node executes? Are these variables used in the final model anyways (which is not ideal)?

 

Thanks for the help!


Accepted Solutions
Solution
‎11-20-2015 12:14 AM
SAS Employee
Posts: 122

Re: Variables in Random Forests in SAS EM

JakesVenter, This is a very good question. I just got off phone with another customer who had difficulty bringing transformation into the score code. You did everything right in keeping and rejecting variable for the HP Forest node (RF)model to work. That is great. When you add nodes to EM flow, EM writes out SAS code for many of them. Transform Variable Node is one of them. (Impute Node is another, but Partition Node is not because you don't generate partition in scoring deployment). What does it write? it writes out your transformations so you don't have to. At the end of the flow, which in your case appears to be a HP Forest node, EM compiles all the individual 'node code bits' along the flow. You probably have picked up your existing copy of the score code from the HP Forest node (Results--> View 00--> Scoring -->Score Code) or you did connect a Score Code to the HP Forest node, but somehow the Optimized Code option under the section Score Code Generation is set as NO. Optimized Code option is designed specifically to help you to cut out unwanted transformations in generating scoring code. When one transforms variables to build models, one does not know which ones will be in the model into scoring, which ones will not. The default transformations in Transform Variables for interval variables are already 7. + you may add more. Imagine your initial inputs are 750 variables. How many are getting the RF node? This glut problem is particular headache for random forest models. In logistic regression, for example, if you feed in, say, 500 transformed variables, the model ends up using 14 of them. The optimized score clearly drops out the unwanted 486 variables. The RF does not work that way. If you build, say, 50000 trees in a forest, variable w25 may never be significant until you arrive at tree # 45000. And it is significant only in that specific branch. The very spirit of RF model is to want to include this branch in voting for the total accuracy. So, generally, the fewer trees you are building, the short the glut transformations in your scoring code you should expect. There is a good practice you may consider: In the HP Forest node (I am running EM 14.1. Some previous versions may not have it yet), there is a Variable Selection option. You can start your RF modeling with variable selection set to NO. Look at the performance curve (likely the misclassification curve). Then turn on variable selection. Compare the two performance curves. Kind of, get an idea of how much performance sacrifice you will pay if you engage variable selection for fewer variables, to cut the scoring glut, among other considerations. Your habit to meticulously apply and reject variable at the Meta data layer absolutely should continue. Your business sense is your best variable selection tool. Big picture wise, think and re-think why you want to transform variables to build a random forest model to begin with. Good luck. Best Regards Jason Xin

View solution in original post


All Replies
Super Contributor
Posts: 336

Re: Variables in Random Forests in SAS EM

[ Edited ]

Hi Jakes,

 

When you say you see the extra variables in the score code, are you sure that you are looking at the score code from HPForest?

Random Forests in Enterprise Miner is one of the few procedures that do not produce SAS Score code. The score code would be quite large, so instead your HPForest node produces a file that another procedure (proc hp4score) uses to score.

 

If you go to results to see the SAS code, you will see something like the below

 

 proc hp4score data=&hpfst_score_input;
  id &hpfst_id_vars;
  %if %symexist(EM_USER_OUTMDLFILE)=0 %then %do;
    score file="D:\EM\EM_Projects\EM14.1\miguel\demo\Workspaces\EMWS1\HPDMForest2\OUTMDLFILE.bin" out=&hpfst_score_output;
  %end;
  %else %do;
    score file="&EM_USER_OUTMDLFILE" out=&hpfst_score_output;
  %end;
    PERFORMANCE  DETAILS;
  run;

 

Kinda puzzled here. Do you have a screenshot that suggests that HPForest is not honoring your metadata selections?

 

Thanks,

Miguel

Solution
‎11-20-2015 12:14 AM
SAS Employee
Posts: 122

Re: Variables in Random Forests in SAS EM

JakesVenter, This is a very good question. I just got off phone with another customer who had difficulty bringing transformation into the score code. You did everything right in keeping and rejecting variable for the HP Forest node (RF)model to work. That is great. When you add nodes to EM flow, EM writes out SAS code for many of them. Transform Variable Node is one of them. (Impute Node is another, but Partition Node is not because you don't generate partition in scoring deployment). What does it write? it writes out your transformations so you don't have to. At the end of the flow, which in your case appears to be a HP Forest node, EM compiles all the individual 'node code bits' along the flow. You probably have picked up your existing copy of the score code from the HP Forest node (Results--> View 00--> Scoring -->Score Code) or you did connect a Score Code to the HP Forest node, but somehow the Optimized Code option under the section Score Code Generation is set as NO. Optimized Code option is designed specifically to help you to cut out unwanted transformations in generating scoring code. When one transforms variables to build models, one does not know which ones will be in the model into scoring, which ones will not. The default transformations in Transform Variables for interval variables are already 7. + you may add more. Imagine your initial inputs are 750 variables. How many are getting the RF node? This glut problem is particular headache for random forest models. In logistic regression, for example, if you feed in, say, 500 transformed variables, the model ends up using 14 of them. The optimized score clearly drops out the unwanted 486 variables. The RF does not work that way. If you build, say, 50000 trees in a forest, variable w25 may never be significant until you arrive at tree # 45000. And it is significant only in that specific branch. The very spirit of RF model is to want to include this branch in voting for the total accuracy. So, generally, the fewer trees you are building, the short the glut transformations in your scoring code you should expect. There is a good practice you may consider: In the HP Forest node (I am running EM 14.1. Some previous versions may not have it yet), there is a Variable Selection option. You can start your RF modeling with variable selection set to NO. Look at the performance curve (likely the misclassification curve). Then turn on variable selection. Compare the two performance curves. Kind of, get an idea of how much performance sacrifice you will pay if you engage variable selection for fewer variables, to cut the scoring glut, among other considerations. Your habit to meticulously apply and reject variable at the Meta data layer absolutely should continue. Your business sense is your best variable selection tool. Big picture wise, think and re-think why you want to transform variables to build a random forest model to begin with. Good luck. Best Regards Jason Xin
Occasional Contributor
Posts: 13

Re: Variables in Random Forests in SAS EM

Hi Jason,

 

Thanks for the info! I will definately be trying the variable selection in the RF node, since my company is running the latest SAS Packages. I am actually not a fan of the more complex models such as the RF node, simply because of its difficulty to explain (especially to my technical commity Smiley Happy ) and the fact that it does not score in a simple data step such as the other "lament" models (like the regression, neural net, etc.). However, in some cases - like the problem I have been dealing with now - it outperforms the other models by a significant margin.

 

The most important part I took away from your message is to go and look at the optimized code. I did manage to get the model to run in Base SAS by using a score node. The main reason for me posting this question was that I saw some of those variables I rejected coming into the variable transformations bit of the code - something you addressed in the reply.

 

Thanks for the help!

Occasional Contributor
Posts: 13

Re: Variables in Random Forests in SAS EM

Hi Miguel,

 

Thanks for the interest in my question. I did manage to get the code runnning in Base SAS / SAS EG by copying the OUTMDLFILE.bin to a certain location on our company's grid processor and referencing it in my code.

 

The reason why I asked the question that I did was: I applied some transformations to the variables before running variable selection / model nodes. When I extracted the code, I saw that some of these variables that I did not want in the model still appeared in the Base SAS code when applying the transformations. I am going to have a look at the Optimized code like Jason suggested, as I admit that might be where my problem is.

 

Eventhough I know exactly how RF models work from a theoretical/academic point of view and I have been involved with some of them in R, I have no idea how the hp4score node in SAS works Smiley Happy

I am doing what any analyst should not do and trusting in the results that the Base SAS code gives me which I got from the E-miner Scoring node.

 

Thanks for the help!

SAS Employee
Posts: 122

Re: Variables in Random Forests in SAS EM

JakesVenter ,

You can try variable selection. Miguel reminded me of something. I am conducting some tests on the variable selection part. Let us know how it goes and we will update later. Thanks. Jason Xin
SAS Employee
Posts: 122

Re: Variables in Random Forests in SAS EM

JakesVenter,

Here is the update. If you turn on Variable Selection inside HP Forest node, it will build a RF model and show you variable importance. If you add another HP Forest node after this one, this new one will build RF model on the selected variables. Rejected variables are those the variable importance of which are negative from the previous HP Forest. You could run just one HP Forest Node, look up the negative, and manually reject them and rerun the same. I thought adding another one helps the productivity. Hope this helps. Have nice weekend. Jason Xin
Occasional Contributor
Posts: 12

Re: Variables in Random Forests in SAS EM

Dear All,

I went through all the conversation and understood that I can't directly extract the RF Scoring code and run it directly through EG like it's done with DT or Regressions etc..

All i need to know now how can I extract the RFscoring code to run it through EG or is this impossible to be done ?

 

Thanks in advance.

Mohammad ElSofany

Data Scientist 

SAS Employee
Posts: 122

Re: Variables in Random Forests in SAS EM

Hi, Within EM, you can attach Score node to the HP Forest node. Then at Score Node, Go to Result. Then View->> to where you normally pick up SAS Code. You can then pick RF scoring piece, which is actually some batch code that saves the Proc HPFOREST modeling syntax (invoked by your EM node operation) , saves the model info to a directory location (which you can copy out to another location to facilitate your EG scoring), + syntax that involves Proc HPSCORE to conduct the RF scoring. You can copy out the whole batch code and deploy it in EG. There are macro variables in the scoring piece that allows you to specify input and out files. In another in-memory analytics product IMSTAT, when you build RF model using RANDOMWOOD statement, you can opt to save the scoring piece into .sas code. Especially when you RF is kind of complicated, the .sas file can become big. The file size can easily be >100MB per se. The HPSCORE procedure used in HP EM creates a SAS proprietary binary file to capture RF model info needed for scoring. The binary file is efficient and 'nimble', only that you have to have the HPDM product installed to run it. It is highly recommended that scoring RF models happens in some kind of in-memory fashion, which is not the typical 'way of life' for EG. Inside HPDM machine learning nodes and procedures, RF --HPFOREST is the only one that requires a separate HP procedure to support scoring. Hope this helps? Thanks for using SAS. Jason Xin
Occasional Contributor
Posts: 13

Re: Variables in Random Forests in SAS EM

 Hi Mohammad

One of the SAS consultants at our company helped me to run the Random Forest in EG. When you extract the RF scoring code and open it in EG, look for somewhere in the code where it declares a macro called %em_hpfst_score. In this macro, the code is looking for a value for the macro variable %em_score_output, but if we just extract the code from EM, it does not populate it by default. So all I did was in the SAS line above where the macro is being declared, I specified the macro variable em_score_output by typing the code let em_score_output = scoreset; where scoreset is the dataset I'm scoring the RF on. Just remember to put the part where the code actually scores the data with the RF within a datastep. I hope this makes sense - let me know if it worked!

Occasional Contributor
Posts: 12

Re: Variables in Random Forests in SAS EM

Thanks JakesVenter, 

 

The thing that I only can see a small code when i open the Model and get the code from it or run scoring code export node (In this node in particular there is 2 codes 1 very long one with all the variables and other is the small one that I'm putting below.

Please note that this small code that I'm displaying below is the same code I can find from the Model node itself.

 

Appreciate your advice.

 


Data work.myoutput;
Set scoreset;
*------------------------------------------------------------*;
* EM SCORE CODE;
* EM Version: 13.2;
* SAS Release: 9.04.01M2P072314;
* Host: SC-172-20-150-203;
* Encoding: utf-8;
* Locale: en_US;
* Project Path: /sasdata2/SAS-USERS/melsofany/MNP/MNP_Test;
* Project Name: MNP_Test;
* Diagram Id: EMWS1;
* Diagram Name: Imp_vars;
* Generated by: unxsrv;
* Date: 30AUG2016:16:13:19;
*------------------------------------------------------------*;
*------------------------------------------------------------*;
* TOOL: Input Data Source;
* TYPE: SAMPLE;
* NODE: Ids;
*------------------------------------------------------------*;
*------------------------------------------------------------*;
* TOOL: Partition Class;
* TYPE: SAMPLE;
* NODE: Part;
*------------------------------------------------------------*;
*------------------------------------------------------------*;
* TOOL: Extension Class;
* TYPE: MODEL;
* NODE: HPDMForest5;
*------------------------------------------------------------*;
%macro em_hpfst_score;

%if %symexist(hpfst_score_input)=0 %then %let hpfst_score_input=&em_score_output;
%if %symexist(hpfst_score_output)=0 %then %let hpfst_score_output=&em_score_output;
%if %symexist(hpfst_id_vars)=0 %then %let hpfst_id_vars = _ALL_;

%let hpvvn= %sysfunc(getoption(VALIDVARNAME));
options validvarname=V7;
proc hp4score data=&hpfst_score_input;
id &hpfst_id_vars;
%if %symexist(EM_USER_OUTMDLFILE)=0 %then %do;
score file="/sasdata2/SAS-USERS/melsofany/MNP/MNP_Test/MNP_Test/Workspaces/EMWS1/HPDMForest5/OUTMDLFILE.bin" out=&hpfst_score_output;
%end;
%else %do;
score file="&EM_USER_OUTMDLFILE" out=&hpfst_score_output;
%end;
PERFORMANCE DETAILS;
run;

options validvarname=&hpvvn;

data &hpfst_score_output;
set &hpfst_score_output;
%mend;

%em_hpfst_score;
*------------------------------------------------------------*;
*Computing Classification Vars: Target;
*------------------------------------------------------------*;
length _format200 $200;
drop _format200;
_format200= ' ' ;
_p_= 0 ;
drop _p_ ;
if P_Target1 - _p_ > 1e-8 then do ;
_p_= P_Target1 ;
_format200='1';
end;
if P_Target0 - _p_ > 1e-8 then do ;
_p_= P_Target0 ;
_format200='0';
end;
I_Target=dmnorm(_format200,32); ;
label U_Target = 'Unnormalized Into: Target';
if I_Target='1' then
U_Target=1;
if I_Target='0' then
U_Target=0;
data &em_score_output;
set &em_score_output;
*------------------------------------------------------------*;
* TOOL: Score Node;
* TYPE: ASSESS;
* NODE: Score;
*------------------------------------------------------------*;
*------------------------------------------------------------*;
* Score: Creating Fixed Names;
*------------------------------------------------------------*;
LABEL EM_EVENTPROBABILITY = 'Probability for level 1 of Target';
EM_EVENTPROBABILITY = P_Target1;
LABEL EM_PROBABILITY = 'Probability of Classification';
EM_PROBABILITY =
max(
P_Target1
,
P_Target0
);
LENGTH EM_CLASSIFICATION $%dmnorlen;
LABEL EM_CLASSIFICATION = "Prediction for Target";
EM_CLASSIFICATION = I_Target;
run;

Occasional Contributor
Posts: 13

Re: Variables in Random Forests in SAS EM

Hi Mohammad,

In addition to what you must do in my previous message, there is also something else that you must do in the code for the RF to work in EG. There should be a file called "OUTMDLFILE.bin" that is generated by the HPDM RF node in Eminer and saved in the folder where your EM project is saved. It should be saved under the "HPDMForest" folder that is located under your "EMWS" folder under the Workspaces folder. Now, in the SAS code, create a macro variable called path and let that macro variable reference this OUTMDLFILE.bin file. For example, something like %let path = "C:\Users\Desktop\OUTMDLFILE.bin" (the path would look something like this if I copied this file to my desktop - it doesn't matter, just as long as you reference this file). Then, again go to the place in the code where the macro em_hpfst_score is defined. Within that macro, reference the path macro variable at the place in the code that I highlighted in the picture attached to this message.


Capture.PNG
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 1672 views
  • 0 likes
  • 4 in conversation