Re: SAS EMiner - Uplift model using Incremental Model

SGhosh · Posted 09-13-2017 12:42 PM

It would be great if I could get any help in understanding these questions (at least the 1st question). I really do have a very tight timeline for a project and I am new to this. Any help would be really appreciated.

In Data Partition Node, what these variables determine?
1. Training
2. Validation
3. Test

Their default values are 40%,30%,30%.

I looked at their definition in EMiner 14.1 Help, but the description is not in an elaborated way.

I am trying to understand:

How its impacting the lift [I changed to 20,40,40 and see the difference, but don’t understand the logic behind this]?

How its impacting the incremental model response?

2. (Related to the above question) From the incremental response model diagnostic, how to predict the variables which could be used from the model as it shows Train, Validation and Test?

Image attached (IRMD_graph) How these variables could be used for model’s variable selection? Image attached (IRMD_table)

Thanks a lot in advance

-Soma

DougWielenga · Posted 09-14-2017 12:59 PM

Data Mining data often involves extremely large data sets (many rows, many columns) and the Data Partition node in SAS Enterprise Miner allows an analyst to break up the overall data in different portions each representing the overall data. This is documented in the application help for SAS Enterprise Miner which can be accessed by opening the application and then clicking on Help --> Contents. From there, navigate in the panel on the left to

Node Reference

Sample

Data Partition Node

and then navigate in the panel on the right to 'Overview of the Data Partition Node' where you find the following:

/*** BEGIN EXCERPT ***/

Most data mining projects utilize large volumes of sampled data. After sampling, the data is usually partitioned before modeling. Use the Data Partition node to partition your input data into one of the following data sets: Train is used for preliminary model fitting. The analyst attempts to find the best model weights using this data set. Validation is used to assess the adequacy of the model in the Model Comparison node.

The validation data set is also used for model fine-tuning in the following nodes:

• Decision Tree node — to create the best subtree.

• Neural Network node — to choose among network architectures or for the early-stopping of the training algorithm.

• Regression node — to choose a final subset of predictors from all the subsets computed during stepwise regression. Test is used to obtain a final, unbiased estimate of the generalization error of the model.

/*** END EXCERPT ***/

In short, Train is intended to build candidate models, Validate is intended for comparing candidate models, and Test is intended for a final unbiased estimate of the final model. SAS Enterprise Miner automatic stratifies the splitting on any binary target (when present) and you can specify other variables to include in your stratification.

To answer your specific Data Partition questions....

How its impacting the lift [I changed to 20,40,40 and see the difference, but don’t understand the logic behind this]?

How its impacting the incremental model response?

As long as each of the partitions are representative, the choice of the percentages should not have a meaningful impact on projected lift or model performance. In the case of rare events, you might need to be careful to stratrfy on the target variable and check whether you have enough observations in each data partition to allow for meaningful insights.

Regarding the Incremental Response Model Diagnostics chart, the plot is just showing performance on each of the three data sets -- Train, Validate, and Test. Remember that candidate models are typically built on the Train data set, a model is chosen based on the Validate data set, and a final estimate of model performance can be found by applying it to the Test data. If the performance varies greatly across these data sets, you might have reason for concern that your data has not been partitioned to create representative subsets (perhaps to few events in one or more partitions, for instance).

If you are asking about variable importance, you can find more detail in the "Selected Variables by NIV" table where NIV is an acronym for "Net Information Value". To access the Incremental Response Node help, open the application and click on

Help --> Contents

and then navigating in the panel on the right to

Node Reference

Applications

Incremental Response Node

and then navigate in the panel on the right to

Incremental Response Node Results

where you find the following near the bottom:

/*** BEGIN HELP EXCERPT ***/

When you set the Prescreen Variables property of the Incremental Response node to Yes before running, the Selected Variables Table displays the top 50% of input variables ranked by NIV score. The NIV score indicates the variables that have the strongest correlation to model responses. The net information value is calculated as the difference in information values between the treatment group and control group for each input variable. The proportion of variables that is selected for inclusion in the table by NIV ranking can be specified in the Rank Percentage Cutoff property for the node.

/*** END HELP EXCERPT ***/

Hope this helps!

Doug

SGhosh · Posted 09-14-2017 01:18 PM

yes, this does help a lot. In fact I was just realizing the fact that how NIV is working in variable selection. Still there's little bit confusion I still have on how the variables are decided to be the strongest ones?

Is there any role of data partition also in changing the variable selection?

Thanks a lot!

DougWielenga · Posted 09-14-2017 02:12 PM

Is there any role of data partition also in changing the variable selection?

The Data Partition only partitions data so it only changes which observations end up in which partition. This is unlikely to change the identification of relatively important variables as long as each partition has enough data to represent the population in a meaningful way. Variables which have less information (or variables that have information that is available through some combination of other variables) might differ slightly based on the exact observations which are assigned to each partition. Variables that are not chosen are not necessarily 'non-informative' since they might have only been excluded because other variables with similar information were slightly more informative.

Hope this helps!

Doug

SGhosh · Posted 09-14-2017 03:38 PM

This really does clarify most of my questions and concerns.

Just one more question please. After following the requirements as much as possible, still my lift looks poor. Its not a good model as you can see in the attached image(within an excel). Does this signifies that I dont have proper, co-related/linked variables or I could do something to improve this to bring up to an expected model?

Thanks so much

Soma

DougWielenga · Posted 09-18-2017 05:01 PM

After following the requirements as much as possible, still my lift looks poor. Its not a good model as you can see in the attached image(within an excel). Does this signifies that I dont have proper, co-related/linked variables or I could do something to improve this to bring up to an expected model?

Lift is a tricky thing because you can only get dramatic sounding values in extremely rare event scenarios. For example, if the overall average rate of response is 10%, the maximum lift = 100% / 10% = 10. Likewise, we have the following:

Overall Max

Rate Lift

20% 5

5% 20

1% 100

0.1% 1,000

As a result, you cannot compare lift across different modeling problems. Additionally, there are many possible scenarios that could result in having an poor-performing model including but not limited to the following:

* the modeling techniques have insufficient tuning

* the data does not sufficiently support the desired modeling technique

* the data has not been fully prepared (e.g. often you can calculate new variables that will help better model a relationships)

* there is not a strong relationship between the variables and the target

I cannot tell from looking at the results plots what could be done to improve the performance, but I would agree that it doesn't appear to be providing much lift at present.

SGhosh · Posted 09-18-2017 05:10 PM

Thanks so much again for your response!

After your response on my other question in the same forum on Friday, I used some numeric variables directly and the result is better now in compare to what I had posted.

On one question still though I am kind of stuck:

As you mentioned:

there is not a strong relationship between the variables and the target.

I am noticing the same issue but not sure how to fix this. It would be really helpful if you could please elaborate this part a little more

Also I was trying to understand the resultant variable. How they are linked between train and validate.?

Thanks a ton !

Soma

DougWielenga · Posted 09-22-2017 11:27 AM

As you mentioned: there is not a strong relationship between the variables and the target.

I am noticing the same issue but not sure how to fix this.

This could be due to many possible factors:

* the information in the input data has not been fully utilized

* the model does not have sufficient flexibility to model the relationship well

* there is insufficient data to identify a more meaningful model

* there really isn't any relationship to model in the data to begin with

You have some things you could do to try and address some of these issues:

* data not fully utilized? --> Make sure you have looked at several ways to get information from your input variables.

Consider the following:

- binning interval variables since binned values allow for non-linear relationship

- transforming categorical inputs with many trivial levels

- extracting aspects of timestamp data (day of week, month, year, lagged time values, time since first event, time between events, etc...)

- using several different variable selection techniques to identify potentially useful variables

* model not sufficiently flexibile? --> Model your treatment group and control group separately, score all observations on both models, and then build a model on the differences using more flexible modeling techniques in SAS Enterprise Miner (I have not tried this so there might be some challenges to doing so, but it should have some efficacy)

* insufficient data? --> try and get more data!

* no relationship? --> you are probably out of luck here.

I was trying to understand the resultant variable. How they are linked between train and validate

I'm not sure what variable you are talking about. The training data is used to train/fit the model, while the validation data is used to compare candidate models fit to the training data. The Incremental Response node fits a separate model to the treatment group and the control group. You can also force a single model that includes the variable used to define treatment/control groups as a binary predictor.

Hope this helps!

Doug

SGhosh · Posted 09-25-2017 12:03 PM

Thanks so much. This provided me sufficient information to use the variales efficiently and finding significant lift out of some variables.

Just one final thing now, using the SAS code which creates certain EM_ variables, is there a way to know the significant for each variable? I see in the output window that most influencing variables has maximum of wald chi square. However, I am trying to figure out how these variables are influencing. For example I did this manual calculation and here is the outcome:

member_age_grp	reg_rate_cntl	reg_rate_test	Lift
18-24	4.3%	4.9%	13.5%
25-34	5.4%	5.9%	8.6%
35-44	4.5%	4.6%	1.6%
45-54	3.6%	3.6%	(0.0%)
55-64	2.5%	3.6%	41.7%

•Members who are 55-64 years old show significant lift in registration rates (41.7%) from receiving the letter. Members between 45-54 doesn't show any lift in terms of receiving letters and this could be because of low sample size

But how could I figure out this lift from SAS EMiner. Can something like this (anything that could show how a veriable is influencing the model or the lift) though the EM_ variables?

Thanks a lot again!

Soma

Ready to join fellow brilliant minds for the SAS Hackathon?