BookmarkSubscribeRSS Feed
SGhosh
Fluorite | Level 6

 

 

It would be great if I could get any help in understanding these questions (at least the 1st question). I really do have a very tight timeline for a project and I am new to this.  Any help would be really appreciated.

 

  1. In Data Partition Node, what these variables determine?
    1. Training
    2. Validation
    3. Test

 

Their default values are 40%,30%,30%.   

                I looked at their definition in EMiner 14.1 Help, but the description is not in an elaborated way.

                    I am trying to understand:

                                How its impacting the lift [I changed to 20,40,40 and see the difference, but don’t understand the logic behind this]?

                                How its impacting the incremental model response?

 

 2. (Related to the above question) From the incremental response model diagnostic, how to predict the variables which could be used from the model as it shows Train, Validation and Test?

  1. Image attached (IRMD_graph)       How these variables could be used for model’s variable selection? Image attached (IRMD_table)

 

Thanks a lot in advance

-Soma

8 REPLIES 8
DougWielenga
SAS Employee

Data Mining data often involves extremely large data sets (many rows, many columns) and the Data Partition node in SAS Enterprise Miner allows an analyst to break up the overall data in different portions each representing the overall data. This is documented in the application help for SAS Enterprise Miner which can be accessed by opening the application and then clicking on Help --> Contents. From there, navigate in the panel on the left to

 

Node Reference

     Sample

            Data Partition Node

 

and then navigate in the panel on the right to 'Overview of the Data Partition Node' where you find the following:

 

/*** BEGIN EXCERPT ***/

 

Most data mining projects utilize large volumes of sampled data. After sampling, the data is usually partitioned before modeling. Use the Data Partition node to partition your input data into one of the following data sets: Train is used for preliminary model fitting. The analyst attempts to find the best model weights using this data set. Validation is used to assess the adequacy of the model in the Model Comparison node.

 

The validation data set is also used for model fine-tuning in the following nodes:

   • Decision Tree node — to create the best subtree.

   • Neural Network node — to choose among network architectures or for the early-stopping of the training algorithm.

   • Regression node — to choose a final subset of predictors from all the subsets computed during stepwise regression. Test is used to obtain a final, unbiased estimate of the generalization error of the model.

 

/*** END EXCERPT ***/

 

In short, Train is intended to build candidate models, Validate is intended for comparing candidate models, and Test is intended for a final unbiased estimate of the final model. SAS Enterprise Miner automatic stratifies the splitting on any binary target (when present) and you can specify other variables to include in your stratification.

 

To answer your specific Data Partition questions....

 

          How its impacting the lift [I changed to 20,40,40 and see the difference, but don’t understand the logic behind this]?

          How its impacting the incremental model response?

 

   As long as each of the partitions are representative, the choice of the percentages should not have a meaningful impact on projected lift or model performance.  In the case of rare events, you might need to be careful to stratrfy on the target variable and check whether you have enough observations in each data partition to allow for meaningful insights.

 

Regarding the Incremental Response Model Diagnostics chart, the plot is just showing performance on each of the three data sets -- Train, Validate, and Test.  Remember that candidate models are typically built on the Train data set, a model is chosen based on the Validate data set, and a final estimate of model performance can be found by applying it to the Test data.   If the performance varies greatly across these data sets, you might have reason for concern that your data has not been partitioned to create representative subsets (perhaps to few events in one or more partitions, for instance).  

 

If you are asking about variable importance, you can find more detail in the "Selected Variables by NIV" table where NIV is an acronym for "Net Information Value".   To access the Incremental Response Node help, open the application and click on

 

      Help --> Contents

 

and then navigating in the panel on the right to 

 

Node Reference 

      Applications

             Incremental Response Node

 

and then navigate in the panel on the right to 

 

    Incremental Response Node Results

 

where you find the following near the bottom:

 

/*** BEGIN HELP EXCERPT ***/

 

When you set the Prescreen Variables property of the Incremental Response node to Yes before running, the Selected Variables Table displays the top 50% of input variables ranked by NIV score. The NIV score indicates the variables that have the strongest correlation to model responses. The net information value is calculated as the difference in information values between the treatment group and control group for each input variable. The proportion of variables that is selected for inclusion in the table by NIV ranking can be specified in the Rank Percentage Cutoff property for the node.

 

/*** END HELP EXCERPT ***/

 

Hope this helps!

Doug

SGhosh
Fluorite | Level 6

yes, this does help a lot. In fact I was just realizing the fact that how NIV is working in variable selection. Still there's little bit confusion I still have on how the variables are decided to be the strongest ones?

Is there any role of data partition also in changing the variable selection?

 

Thanks a lot!

 

DougWielenga
SAS Employee

Is there any role of data partition also in changing the variable selection?

 

The Data Partition only partitions data so it only changes which observations end up in which partition.  This is unlikely to change the identification of relatively important variables as long as each partition has enough data to represent the population in a meaningful way. Variables which have less information (or variables that have information that is available through some combination of other variables) might differ slightly based on the exact observations which are assigned to each partition.  Variables that are not chosen are not necessarily 'non-informative' since they might have only been excluded because other variables with similar information were slightly more informative.  

 

Hope this helps!

Doug

SGhosh
Fluorite | Level 6

This really does clarify most of my questions and concerns.

Just one more question please. After following the requirements as much as possible, still my lift looks poor. Its not a good model  as you can see in the attached image(within an excel). Does this signifies that I dont have proper, co-related/linked variables or I could do something to improve this to bring up to an expected model?

 

Thanks so much

Soma

 

DougWielenga
SAS Employee

After following the requirements as much as possible, still my lift looks poor. Its not a good model  as you can see in the attached image(within an excel). Does this signifies that I dont have proper, co-related/linked variables or I could do something to improve this to bring up to an expected model?

 

Lift is a tricky thing because you can only get dramatic sounding values in extremely rare event scenarios.  For example, if the overall average rate of response is 10%, the maximum lift = 100% / 10% = 10.   Likewise, we have the following:

 

Overall     Max     

Rate          Lift  

   20%          5

    5%         20

     1%       100

  0.1%     1,000 

 

As a result, you cannot compare lift across different modeling problems.  Additionally, there are many possible scenarios that could result in having an poor-performing model including but not limited to the following:

   *  the modeling techniques have insufficient tuning

   *  the data does not sufficiently support the desired modeling technique

   *  the data has not been fully prepared (e.g. often you can calculate new variables that will help better model a relationships)

   *  there is not a strong relationship between the variables and the target

 

I cannot tell from looking at the results plots what could be done to improve the performance, but I would agree that it doesn't appear to be providing much lift at present.  

SGhosh
Fluorite | Level 6

Thanks so much again for your response!

After your response on my other question in the same forum on Friday, I used some numeric variables directly and the result is better now in compare to what I had posted.

On one question still though I am kind of stuck:

As you mentioned:

there is not a strong relationship between the variables and the target.

I am noticing the same issue but not sure how to fix this. It would be really helpful if you could please elaborate this part a little more

Also I was trying to understand the resultant variable. How they are linked between train and validate.?

 

Thanks a ton !

Soma

DougWielenga
SAS Employee

As you mentioned: there is not a strong relationship between the variables and the target.

I am noticing the same issue but not sure how to fix this. 

 

This could be due to many possible factors:

    * the information in the input data has not been fully utilized

    * the model does not have sufficient flexibility to model the relationship well

    * there is insufficient data to identify a more meaningful model

    * there really isn't any relationship to model in the data to begin with

 

You have some things you could do to try and address some of these issues:

    * data not fully utilized?  --> Make sure you have looked at several ways to get information from your input variables.  

Consider the following:

       - binning interval variables since binned values allow for non-linear relationship

       - transforming categorical inputs with many trivial levels

       - extracting aspects of timestamp data (day of week, month, year, lagged time values, time since first event, time between events, etc...) 

       - using several different variable selection techniques to identify potentially useful variables 

 

    * model not sufficiently flexibile?  --> Model your treatment group and control group separately, score all observations on both models, and then build a model on the differences using more flexible modeling techniques in SAS Enterprise Miner (I have not tried this so there might be some challenges to doing so, but it should have some efficacy) 

  

    *  insufficient data? --> try and get more data!

 

    *  no relationship?  --> you are probably out of luck here.

 

I was trying to understand the resultant variable. How they are linked between train and validate

 

I'm not sure what variable you are talking about.   The training data is used to train/fit the model, while the validation data is used to compare candidate models fit to the training data.  The Incremental Response node fits a separate model to the treatment group and the control group.  You can also force a single model that includes the variable used to define treatment/control groups as a binary predictor.   

 

Hope this helps!

Doug

SGhosh
Fluorite | Level 6

Thanks so much. This provided me sufficient information to use the variales efficiently and finding significant lift out of some variables.

Just one final thing now, using the SAS code which creates certain EM_ variables, is there a way to know the significant for each variable? I see in the output window that most influencing variables has maximum of wald chi square. However, I am trying to figure out how these variables are influencing. For example I did this manual calculation and here is the outcome:

member_age_grpreg_rate_cntlreg_rate_testLift
18-244.3%4.9%13.5%
25-345.4%5.9%8.6%
35-444.5%4.6%1.6%
45-543.6%3.6%(0.0%)
55-642.5%3.6%

41.7%

 

•Members who are 55-64 years old show significant lift in registration rates (41.7%) from receiving the letter. Members between 45-54 doesn't show any lift in terms of receiving letters and this could be because of low sample size

 

But how could I figure out this lift from SAS EMiner. Can something like this (anything that could show how a veriable is influencing the model or the lift) though the EM_ variables?

 

Thanks a lot again!

 

Soma

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2636 views
  • 3 likes
  • 2 in conversation