Re: Problem with opposite effects in scorecard

KristineNavesta · Posted 04-22-2016 04:47 AM

Hi

I am using SAS Enterprise Miner 13.2 with the Credit Scoring to build a prediction model for the usage of credit cards.

I suspect a problem with collinearity in my input data, as I always end up with at least one positive effect while the rest is negative. Depending on which criteria and variables I choose to include, this might be a different variable for each setting, and the same variable might be a positive effect in some settings and a negative one in other settings.

What is a good strategy to avoid this problem?

It is very difficult to explain the variables on its own, when you have a variable with opposite effect.

Do I risk losing valuable information by excluding the variable?

Is it a good way to identify which of the included variables in the scorecard are related, when explaining this effect?

Or just keep the opposite effects and give the answer "because the statistician said so" when asked?

I know that the data might be related, and I am not too worried about new data being from a different population, as we are looking at our own customer database, and will continue to do so.

                                   Analysis of Maximum Likelihood Estimates

                                                Standard          Wald                  Standardized
Parameter                     DF    Estimate       Error    Chi-Square    Pr > ChiSq        Estimate    Exp(Est)

Intercept                      1     -2.9574      0.0697       1798.25        <.0001                       0.052
WOE_1                       1     -0.7656      0.0718        113.81        <.0001         -1.1490       0.465
WOE_2                       1     -0.3554      0.1008         12.43        0.0004         -0.3569       0.701
WOE_3                       1     -0.4776      0.0592         65.10        <.0001         -0.2544       0.620
WOE_4                      1     -0.2444      0.1340          3.33        0.0682         -0.0642       0.783
WOE_5                       1      0.2427      0.1030          5.55        0.0185          0.0562       1.275

The last effect here is positive, while the rest are negative.

Fit statistics, just for fun

Fit
Statistics	Statistics Label	Train	Validation	Test

_AIC_	Akaike's Information Criterion	3508.10	.	.
_ASE_	Average Squared Error	0.05	0.05	0.05
_AVERR_	Average Error Function	0.17	0.17	0.17
_DFE_	Degrees of Freedom for Error	10557.00	.	.
_DFM_	Model Degrees of Freedom	6.00	.	.
_DFT_	Total Degrees of Freedom	10563.00	.	.
_DIV_	Divisor for ASE	21126.00	15846.00	15850.00
_ERR_	Error Function	3496.10	2655.00	2670.64
_FPE_	Final Prediction Error	0.05	.	.
_MAX_	Maximum Absolute Error	1.00	1.00	0.99
_MSE_	Mean Square Error	0.05	0.05	0.05
_NOBS_	Sum of Frequencies	10563.00	7923.00	7925.00
_NW_	Number of Estimate Weights	6.00	.	.
_RASE_	Root Average Sum of Squares	0.21	0.21	0.21
_RFPE_	Root Final Prediction Error	0.21	.	.
_RMSE_	Root Mean Squared Error	0.21	0.21	0.21
_SBC_	Schwarz's Bayesian Criterion	3551.69	.	.
_SSE_	Sum of Squared Errors	963.57	724.58	731.04
_SUMW_	Sum of Case Weights Times Freq	21126.00	15846.00	15850.00
_MISC_	Misclassification Rate	0.05	0.05	0.05
_AUR_	Area Under ROC	0.83	0.82	0.81
_Gini_	Gini Coefficient	0.65	0.64	0.62
_KS_	Kolmogorov-Smirnov Statistic	0.51	0.52	0.51
_ARATIO_	Accuracy Ratio	0.65	0.64	0.62

WendyCzika · Posted 04-22-2016 10:17 AM

You're absolutely right - it is likely due to collinearity among your inputs. Are you using a model selection method in the Scorecard node? That might help eliminate the problem.

KristineNavesta · Posted 04-25-2016 01:57 AM

Yes, I am using stepwise model selection. Multicollinearity is a problem in most model selection methods as well, as the variables on its own give good meaning, and together they get a to high absolute value of the coefficient, but with opposite signs.

I have tried adding a variable clustering node and using the cluster variables, but my model statistics drop and I get a poorer model.

Is there a way in Miner to figure out which of the variables are most correlated? Is using the clustering variable the best option?

WendyCzika · Posted 04-25-2016 09:58 PM

You could try doing variable selection with the HP Variable Selection node (on the HPDM tab). With unsupervised selection (an option for the Target Model property), it analyzes variance and reduces dimensionality by forward selection of the variables that contribute the most to the overall data variance. Or you can do sequential selection which first performs unsupervised selection, then does supervised selection where the target is taken into account.

KristineNavesta · Posted 04-26-2016 03:24 AM

Very cool, I get really different variables as the selected variabels than the IG and scorecard node would choose. Then using the interactive grouping and scorecard node, I get a model with less variables, and still one positive effect, three negative effects.

So, still opposite effects, weaker variable coefficients, and the model comparison node will rather choose my previous model.

I am guessing that I have to accept that the data has too much collinearity and that it I really should try to find new data or more independent variables?