BookmarkSubscribeRSS Feed
KristineNavesta
Fluorite | Level 6

Hi

 

I am using SAS Enterprise Miner 13.2 with the Credit Scoring to build a prediction model for the usage of credit cards.

 

I suspect a problem with collinearity in my input data, as I always end up with at least one positive effect while the rest is negative. Depending on which criteria and variables I choose to include, this might be a different variable for each setting, and the same variable might be a positive effect in some settings and a negative one in other settings.

 

What is a good strategy to avoid this problem?

 

It is very difficult to explain the variables on its own, when you have a variable with opposite effect.

Do I risk losing valuable information by excluding the variable?

 

Is it a good way to identify which of the included variables in the scorecard are related, when explaining this effect?

Or just keep the opposite effects and give the answer "because the statistician said so" when asked?

 

I know that the data might be related, and I am not too worried about new data being from a different population, as we are looking at our own customer database, and will continue to do so.

 

                                   Analysis of Maximum Likelihood Estimates
 
                                                Standard          Wald                  Standardized
Parameter                     DF    Estimate       Error    Chi-Square    Pr > ChiSq        Estimate    Exp(Est)
 
Intercept                      1     -2.9574      0.0697       1798.25        <.0001                       0.052
WOE_1                       1     -0.7656      0.0718        113.81        <.0001         -1.1490       0.465
WOE_2                       1     -0.3554      0.1008         12.43        0.0004         -0.3569       0.701
WOE_3                       1     -0.4776      0.0592         65.10        <.0001         -0.2544       0.620
WOE_4                       1     -0.2444      0.1340          3.33        0.0682         -0.0642       0.783
WOE_5                       1      0.2427      0.1030          5.55        0.0185          0.0562       1.275
 

The last effect here is positive, while the rest are negative.

 

Fit statistics, just for fun

Fit    
StatisticsStatistics LabelTrainValidationTest
     
_AIC_Akaike's Information Criterion3508.10..
_ASE_Average Squared Error0.050.050.05
_AVERR_Average Error Function0.170.170.17
_DFE_Degrees of Freedom for Error10557.00..
_DFM_Model Degrees of Freedom6.00..
_DFT_Total Degrees of Freedom10563.00..
_DIV_Divisor for ASE21126.0015846.0015850.00
_ERR_Error Function3496.102655.002670.64
_FPE_Final Prediction Error0.05..
_MAX_Maximum Absolute Error1.001.000.99
_MSE_Mean Square Error0.050.050.05
_NOBS_Sum of Frequencies10563.007923.007925.00
_NW_Number of Estimate Weights6.00..
_RASE_Root Average Sum of Squares0.210.210.21
_RFPE_Root Final Prediction Error0.21..
_RMSE_Root Mean Squared Error0.210.210.21
_SBC_Schwarz's Bayesian Criterion3551.69..
_SSE_Sum of Squared Errors963.57724.58731.04
_SUMW_Sum of Case Weights Times Freq21126.0015846.0015850.00
_MISC_Misclassification Rate0.050.050.05
_AUR_Area Under ROC0.830.820.81
_Gini_Gini Coefficient0.650.640.62
_KS_Kolmogorov-Smirnov Statistic0.510.520.51
_ARATIO_Accuracy Ratio0.650.640.62
4 REPLIES 4
WendyCzika
SAS Employee

You're absolutely right - it is likely due to collinearity among your inputs.  Are you using a model selection method in the Scorecard node?  That might help eliminate the problem.

 

KristineNavesta
Fluorite | Level 6

Yes, I am using stepwise model selection. Multicollinearity is a problem in most model selection methods as well, as the variables on its own give good meaning, and together they get a to high absolute value of the coefficient, but with opposite signs.

 

I have tried adding a variable clustering node and using the cluster variables, but my model statistics drop and I get a poorer model.

 

Is there a way in Miner to figure out which of the variables are most correlated? Is using the clustering variable the best option?

WendyCzika
SAS Employee

You could try doing variable selection with the HP Variable Selection node (on the HPDM tab).  With unsupervised selection (an option for the Target Model property), it analyzes variance and reduces dimensionality by forward selection of the variables that contribute the most to the overall data variance.  Or you can do sequential selection which first performs unsupervised selection, then does supervised selection where the target is taken into account.  

KristineNavesta
Fluorite | Level 6

Very cool, I get really different variables as the selected variabels than the IG and scorecard node would choose. Then using the interactive grouping and scorecard node, I get a model with less variables, and still one positive effect, three negative effects.

 

So, still opposite effects, weaker variable coefficients, and the model comparison node will rather choose my previous model.

 

I am guessing that I have to accept that the data has too much collinearity and that it I really should try to find new data or more independent variables?

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1177 views
  • 1 like
  • 2 in conversation