a month ago
I am building a churn predictive model using logistic regression. My dataset is an unbalanced panel data that reports the behavior across time of the 350.000 customers a retail bank has. Now, my doubts concern how SAS treats unbalanced panel data when running a logistic regression.
Can an unbalanced panel data create issues when running the PROC LOGISTIC statement?
So far, I removed and/or imputed missing values, detected outliers and removed multicollinearity. Now, I am ready to start building my model. However, I am afraid that an unbalanced panel data will create problems when SAS will analyze it using PROC LOGISTIC.
Can you please explain me better how SAS treats an unbalanced panel data with the PROC LOGISTIC?
Thank you in advance.
a month ago - last edited a month ago
Since you say it is panel data, I assume that you have repeated binary responses over time from each customer. Observations within one customer are likely correlated and the correlation should be taken into account in the analysis. Probably the most common approach for this is a Generalized Estimating Equations (GEE) logistic model which can be done using PROC GEE or PROC GENMOD. Use the REPEATED statement in either of these, specifying a variable in the SUBJECT= option that has a unique value for each customer. The data set should have multiple observations per customer. Specify the DIST=BINOMIAL option in the MODEL statement to fit a GEE logistic model. Another approach is a conditional logistic model which can be fit in PROC LOGISTIC using your customer variable in the STRATA statement. Either approach allows unequal numbers of responses per customer. Some discussion of these models can be found in:
"Categorical Data Analysis Using SAS, Third Edition" (Stokes, M. et. al., SAS Institute, 2012)
"Logistic Regression Using SAS: Theory and Application, Second Edition," (Allison, P., SAS Institute, 2012)
"Fixed Effects Regression Methods for Longitudinal Data Using SAS" (Allison, P., SAS Institute, 2005)