Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 115
Accepted Solution

Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

What is the best and most efficient way to save a Tree Diagram in Enterprise Miner (EM) and apply it to all 100% of the data for final results? I wish to keep my nodes static as much as I can, and as easily as possible.

I am starting out using an 80/20 split. This might move closer to 60/40, but we will see. On Monday I think we will have our model finalized. Then I would like it applied to 100% of my original data.

It also will help if EM generates code to be fully utilized by Enterprise Guide. Below is an example of some of the code generated by EM:

Node = 166

*------------------------------------------------------------*

if PURE_PREMIUM >= 5684.5 or MISSING

AND PAYROLL < 693492

AND HAZARD_CODE <= D

AND Business Unit IS ONE OF: 2, 3 or MISSING

AND BLEND_GROSS_LOAD2 >= 149 or MISSING

AND BLEND_GROSS_LOAD1 < 40.5 or MISSING

then

Tree Node Identifier   = 166

Number of Observations = 236

Predicted: D_GROSS_LOADED_WITH_TREND=1 = 0.54

Predicted: D_GROSS_LOADED_WITH_TREND=0 = 0.46

That does not help me much. I would like it more if it looked like something that can be used within standard SAS code. For example - If (PURE_PREMIUM <= 5684.5) or (PURE_PREMIUM =  .)) then ...;

Perhaps I am missing the option to create this code within EM. Thank you.


Accepted Solutions
Solution
‎11-26-2014 09:42 PM
Super User
Posts: 19,862

Re: Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

EM does generate the full score code that can be used in EG

I believe there's a score code node that generates the code, what version of EM are you on?

View solution in original post


All Replies
Solution
‎11-26-2014 09:42 PM
Super User
Posts: 19,862

Re: Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

EM does generate the full score code that can be used in EG

I believe there's a score code node that generates the code, what version of EM are you on?

Frequent Contributor
Posts: 115

Re: Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

I am on 13.1, and yes I found the score code node. Thank you very much.

Do you recommend I use the Optimized SAS Code or just the regular SAS Code? What are the differences?

Also, there are a lot of code statements in there that I am not familiar with. Do you suggest I just have my code point to my data file/library then just run all of this code that was generated? Or do you recommend anything else? Below are some of the initial statements in my code - not making a lot of sense to me right now:

_ARBFMT_12 = PUT( BU , BEST12.);

%DMNORMIP( _ARBFMT_12);

IF _ARBFMT_12 IN ('1' ) THEN DO;

  IF  NOT MISSING(PURE_PREMIUM ) AND

                  5112.5 <= PURE_PREMIUM  THEN DO;

    _ARBFMT_12 = PUT( SMNQ_D_POST_CODE , BEST12.);

     %DMNORMIP( _ARBFMT_12);

    IF _ARBFMT_12 IN ('3' ) THEN DO;

      _NODE_  =                   81;

      _LEAF_  =                   21;

May I ask for an example of how the Score Code node is incorporated with EG? The reason that I am a little nervous is because I plan on applying this code to a new datset with the same variables. I am much more accustomed to simple code.

Thank you again.

Super User
Posts: 19,862

Re: Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

It shouldn't matter which version of the code you use.

Some of the stuff at the top is transformations that may have occurred in various steps of the analysis. 

To use this in EG create a program as follows and that should do what you want.

Data Score;

set <your data>;

<insert code from Enterprise Miner>

run;

Frequent Contributor
Posts: 115

Re: Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

Thank you again for all of the valuable advice.

But I have a huge problem. My diagram is rather simple. I have a data node, a data partition node, then a decision tree node. One of my colleagues and I were "interactively" changing the results of the decision tree mode to be more in alignment with what our results should substantively say. Ultimately we did it using the interactive feature of the decision tree node, then we closed it. We changed a few more things and ultimately re-ran the decision tree node to sort of start over again. But it also re-ran the data partition as well - we cannot figure out why.

Before I fully implement the code generator I would like to somehow lock-down the other nodes. Is this possible with EM to insure that nothing will change?

Super Contributor
Posts: 337

Re: Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

All Enterprise Miner nodes just re-run the part that you need. For example if you change a property under the Report section, only the code that is involved in reporting will re-run.

In your example, if you don't change any properties on the Data Partition node, the green wheel might make it look like its running, but it is just checking that some tables or results exist. It is not re-running the whole thing.

I would need to double check, but I think that if you are using Interactive mode to grow your tree you have to save it, and close it, and not change any property. If you are going to re-run stuff I would suggest you to turn on the property Use Frozen Tree to Yes.

As a bonus, challenge your interactive tree with some other trees. I would use the following and compare their subtree assessment plots and their fit statistics with a Model Comparison node:

  • Largest tree just to confirm that the Largest is an overtrained model.
  • Default tree (maxdepth 6)
  • Tree with maximum depth 10

Good luck,

Miguel

SAS Employee
Posts: 106

Re: Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

Zachary,

Also check that the Rerun property is set to No for your datasource node. If it is set to Yes, that would explain why the Partition node is running each time you execute the flow.

Ray

Frequent Contributor
Posts: 115

Re: Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

It was already set to No by default. Thank you for the suggestion there.

Super Contributor
Posts: 337

Re: Best Way to Finalize a Model Using 100% Data After 80/20 Training/Validation Split?

You got what you needed, good to go? How does your tree beat a default tree?

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 541 views
  • 8 likes
  • 4 in conversation