Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Problem with opposite effects in scorecard

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-22-2016 04:47 AM

Hi

I am using SAS Enterprise Miner 13.2 with the Credit Scoring to build a prediction model for the usage of credit cards.

I suspect a problem with collinearity in my input data, as I always end up with at least one positive effect while the rest is negative. Depending on which criteria and variables I choose to include, this might be a different variable for each setting, and the same variable might be a positive effect in some settings and a negative one in other settings.

What is a good strategy to avoid this problem?

It is very difficult to explain the variables on its own, when you have a variable with opposite effect.

Do I risk losing valuable information by excluding the variable?

Is it a good way to identify which of the included variables in the scorecard are related, when explaining this effect?

Or just keep the opposite effects and give the answer "because the statistician said so" when asked?

I know that the data might be related, and I am not too worried about new data being from a different population, as we are looking at our own customer database, and will continue to do so.

Analysis of Maximum Likelihood Estimates

Standard Wald Standardized

Parameter DF Estimate Error Chi-Square Pr > ChiSq Estimate Exp(Est)

Intercept 1 -2.9574 0.0697 1798.25 <.0001 0.052

WOE_1 1 -0.7656 0.0718 113.81 <.0001 -1.1490 0.465

WOE_2 1 -0.3554 0.1008 12.43 0.0004 -0.3569 0.701

WOE_3 1 -0.4776 0.0592 65.10 <.0001 -0.2544 0.620

WOE_4 1 -0.2444 0.1340 3.33 0.0682 -0.0642 0.783

WOE_5 1 0.2427 0.1030 5.55 0.0185 0.0562 1.275

The last effect here is positive, while the rest are negative.

Fit statistics, just for fun

Fit | ||||

Statistics | Statistics Label | Train | Validation | Test |

_AIC_ | Akaike's Information Criterion | 3508.10 | . | . |

_ASE_ | Average Squared Error | 0.05 | 0.05 | 0.05 |

_AVERR_ | Average Error Function | 0.17 | 0.17 | 0.17 |

_DFE_ | Degrees of Freedom for Error | 10557.00 | . | . |

_DFM_ | Model Degrees of Freedom | 6.00 | . | . |

_DFT_ | Total Degrees of Freedom | 10563.00 | . | . |

_DIV_ | Divisor for ASE | 21126.00 | 15846.00 | 15850.00 |

_ERR_ | Error Function | 3496.10 | 2655.00 | 2670.64 |

_FPE_ | Final Prediction Error | 0.05 | . | . |

_MAX_ | Maximum Absolute Error | 1.00 | 1.00 | 0.99 |

_MSE_ | Mean Square Error | 0.05 | 0.05 | 0.05 |

_NOBS_ | Sum of Frequencies | 10563.00 | 7923.00 | 7925.00 |

_NW_ | Number of Estimate Weights | 6.00 | . | . |

_RASE_ | Root Average Sum of Squares | 0.21 | 0.21 | 0.21 |

_RFPE_ | Root Final Prediction Error | 0.21 | . | . |

_RMSE_ | Root Mean Squared Error | 0.21 | 0.21 | 0.21 |

_SBC_ | Schwarz's Bayesian Criterion | 3551.69 | . | . |

_SSE_ | Sum of Squared Errors | 963.57 | 724.58 | 731.04 |

_SUMW_ | Sum of Case Weights Times Freq | 21126.00 | 15846.00 | 15850.00 |

_MISC_ | Misclassification Rate | 0.05 | 0.05 | 0.05 |

_AUR_ | Area Under ROC | 0.83 | 0.82 | 0.81 |

_Gini_ | Gini Coefficient | 0.65 | 0.64 | 0.62 |

_KS_ | Kolmogorov-Smirnov Statistic | 0.51 | 0.52 | 0.51 |

_ARATIO_ | Accuracy Ratio | 0.65 | 0.64 | 0.62 |

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-22-2016 10:17 AM

You're absolutely right - it is likely due to collinearity among your inputs. Are you using a model selection method in the Scorecard node? That might help eliminate the problem.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2016 01:57 AM

Yes, I am using stepwise model selection. Multicollinearity is a problem in most model selection methods as well, as the variables on its own give good meaning, and together they get a to high absolute value of the coefficient, but with opposite signs.

I have tried adding a variable clustering node and using the cluster variables, but my model statistics drop and I get a poorer model.

Is there a way in Miner to figure out which of the variables are most correlated? Is using the clustering variable the best option?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-25-2016 09:58 PM

You could try doing variable selection with the HP Variable Selection node (on the HPDM tab). With unsupervised selection (an option for the **Target** **Model** property), it analyzes variance and reduces dimensionality by forward selection of the variables that contribute the most to the overall data variance. Or you can do sequential selection which first performs unsupervised selection, then does supervised selection where the target is taken into account.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-26-2016 03:24 AM

Very cool, I get really different variables as the selected variabels than the IG and scorecard node would choose. Then using the interactive grouping and scorecard node, I get a model with less variables, and still one positive effect, three negative effects.

So, still opposite effects, weaker variable coefficients, and the model comparison node will rather choose my previous model.

I am guessing that I have to accept that the data has too much collinearity and that it I really should try to find new data or more independent variables?