Hi
I am using SAS Enterprise Miner 13.2 with the Credit Scoring to build a prediction model for the usage of credit cards.
I suspect a problem with collinearity in my input data, as I always end up with at least one positive effect while the rest is negative. Depending on which criteria and variables I choose to include, this might be a different variable for each setting, and the same variable might be a positive effect in some settings and a negative one in other settings.
What is a good strategy to avoid this problem?
It is very difficult to explain the variables on its own, when you have a variable with opposite effect.
Do I risk losing valuable information by excluding the variable?
Is it a good way to identify which of the included variables in the scorecard are related, when explaining this effect?
Or just keep the opposite effects and give the answer "because the statistician said so" when asked?
I know that the data might be related, and I am not too worried about new data being from a different population, as we are looking at our own customer database, and will continue to do so.
Analysis of Maximum Likelihood Estimates
Standard Wald Standardized
Parameter DF Estimate Error Chi-Square Pr > ChiSq Estimate Exp(Est)
Intercept 1 -2.9574 0.0697 1798.25 <.0001 0.052
WOE_1 1 -0.7656 0.0718 113.81 <.0001 -1.1490 0.465
WOE_2 1 -0.3554 0.1008 12.43 0.0004 -0.3569 0.701
WOE_3 1 -0.4776 0.0592 65.10 <.0001 -0.2544 0.620
WOE_4 1 -0.2444 0.1340 3.33 0.0682 -0.0642 0.783
WOE_5 1 0.2427 0.1030 5.55 0.0185 0.0562 1.275
The last effect here is positive, while the rest are negative.
Fit statistics, just for fun
Fit | ||||
Statistics | Statistics Label | Train | Validation | Test |
_AIC_ | Akaike's Information Criterion | 3508.10 | . | . |
_ASE_ | Average Squared Error | 0.05 | 0.05 | 0.05 |
_AVERR_ | Average Error Function | 0.17 | 0.17 | 0.17 |
_DFE_ | Degrees of Freedom for Error | 10557.00 | . | . |
_DFM_ | Model Degrees of Freedom | 6.00 | . | . |
_DFT_ | Total Degrees of Freedom | 10563.00 | . | . |
_DIV_ | Divisor for ASE | 21126.00 | 15846.00 | 15850.00 |
_ERR_ | Error Function | 3496.10 | 2655.00 | 2670.64 |
_FPE_ | Final Prediction Error | 0.05 | . | . |
_MAX_ | Maximum Absolute Error | 1.00 | 1.00 | 0.99 |
_MSE_ | Mean Square Error | 0.05 | 0.05 | 0.05 |
_NOBS_ | Sum of Frequencies | 10563.00 | 7923.00 | 7925.00 |
_NW_ | Number of Estimate Weights | 6.00 | . | . |
_RASE_ | Root Average Sum of Squares | 0.21 | 0.21 | 0.21 |
_RFPE_ | Root Final Prediction Error | 0.21 | . | . |
_RMSE_ | Root Mean Squared Error | 0.21 | 0.21 | 0.21 |
_SBC_ | Schwarz's Bayesian Criterion | 3551.69 | . | . |
_SSE_ | Sum of Squared Errors | 963.57 | 724.58 | 731.04 |
_SUMW_ | Sum of Case Weights Times Freq | 21126.00 | 15846.00 | 15850.00 |
_MISC_ | Misclassification Rate | 0.05 | 0.05 | 0.05 |
_AUR_ | Area Under ROC | 0.83 | 0.82 | 0.81 |
_Gini_ | Gini Coefficient | 0.65 | 0.64 | 0.62 |
_KS_ | Kolmogorov-Smirnov Statistic | 0.51 | 0.52 | 0.51 |
_ARATIO_ | Accuracy Ratio | 0.65 | 0.64 | 0.62 |
You're absolutely right - it is likely due to collinearity among your inputs. Are you using a model selection method in the Scorecard node? That might help eliminate the problem.
Yes, I am using stepwise model selection. Multicollinearity is a problem in most model selection methods as well, as the variables on its own give good meaning, and together they get a to high absolute value of the coefficient, but with opposite signs.
I have tried adding a variable clustering node and using the cluster variables, but my model statistics drop and I get a poorer model.
Is there a way in Miner to figure out which of the variables are most correlated? Is using the clustering variable the best option?
You could try doing variable selection with the HP Variable Selection node (on the HPDM tab). With unsupervised selection (an option for the Target Model property), it analyzes variance and reduces dimensionality by forward selection of the variables that contribute the most to the overall data variance. Or you can do sequential selection which first performs unsupervised selection, then does supervised selection where the target is taken into account.
Very cool, I get really different variables as the selected variabels than the IG and scorecard node would choose. Then using the interactive grouping and scorecard node, I get a model with less variables, and still one positive effect, three negative effects.
So, still opposite effects, weaker variable coefficients, and the model comparison node will rather choose my previous model.
I am guessing that I have to accept that the data has too much collinearity and that it I really should try to find new data or more independent variables?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.