09-24-2013 03:38 PM
I am trying to build an propensity model in order to describe general purpose loan demand and then predict indivuals which are most likely to buy general purpose loan…
But while preparing my data set i have had some challenges.
First i have coded every loan application in 2012 as 1 and remaining as 0… I am just aiming to asses factors causes some one more likely yo buy a credit package…
However I want to describe account balances at the time of application and after application so i have used account balance three months prior to loan application as a explanatory variable, for the cases which had applied to credit package it is clear to find values, but for the ones that have not applied for a loan in 2012 i am unable to find any specific account balance data
Because that one’s has no application date
Do you have any idea how can i keep this kind of variable in my work…
Thank you very much
09-28-2013 07:18 PM
In credit risk separation model such as your case, there are typically two kinds. One is credit acquisition. The other is behavior risk related to accounts existing on your book. Your description seems to be the former; if later you would have missing balance history on the non-applicant.
By all means, you should avoid using the data that pertain to one side of the separation only. Technically, that is separation or quansi-separation by birth: when you build your model, it will be dominated by one or two such variables.
If it is just one such variable you are 'crazy' about (I suspect you have a quite few. Possible your boss just dictates this one to you), you can engage other variables that have available observations on both 0 and 1 groups. Do a clustering or KNN on the 1 and 0 combined, hoping to see a good mix of 1 and 0 in the resulting clusters or 'neighborhood'. Depending on how the non-missing account balance variable distributes inside the clusters, you can decide voting mechanism to impute the missing value for the 0 group. If you are comfortable building up large number of clusters, you can have fairly differential impute values for the missing. But don't drive too far. One rather primitive exercise to this is subgroup regression: pick some other variables that are common to both groups to predict the balance, using the 1 group only. Then use the model to score on the non-missing group. This method has a lot of complications down the road for your model. This practice, in essence, is the same as 'reject inference' where the focus is to infer 1 and 0 assignment on the rejected applicant group the charge-off (bad) status is unknown due to the rejection. Overall, this practice should not be applied to many variables as model drivers in the same model universe.