Hi all,
I need some of your precious insights for my Master Thesis and hopefully you will help.
I am trying to build a churn predictive model for a retail bank and I would like to use regression analysis for doing it. In particular, I would like to use the logit to achieve my goal.
My dataset is a panel data with 135000 customers whose behavior was tracked during an observation period of T=17 months. I have 50 variables in the dataset. 22 variables are dummies which describe whether or not the client has a give financial product or account as well as whether or not the client has done any debit or credit transaction on them. Then, the rest are demographic variable plus continous and descrete variables that describe how the client's earnings and financial position and credit risk have changed over time.
Now, I have finished with the preprocessing phase. I have formatted variables, imputed missing values, delimited my dataset by age in order to eliminate the involuntary churners and created a target variable called "churn".
My doubts now are:
1) How can I detect outliers? Does the following code work also for longitudinal data?
proc reg data=have;
title "ABC Outliers";
model churn=Customer_id_Ano
Month
family_id_Ano
Earning_Cust12M
Earning_Fam12M
customer_status_Numeric
account_basic_sum
csi
customer_trans_count
logons
meeting_n
overdraft_facility_sum
volume_investments
volume_loans
volume_pension_lifeins
volume_savings
LTV_current
Teller_trans_3m_last
customer_age
first_account_open
account_save_debit_transaction
account_save_credit_transaction
account_save
account_savings_volume
basic_bank_debit_transaction
basic_bank_credit_transaction
credit_card_debit_transaction
credit_card_credit_transaction
loan_home
loan_home_volume
meeting_with_client
atm_transaction
dialogue_with_advisor
secure_message_sent
teller_dialogue
basic_banking
day_to_day_finance
ebank_0
home_finance
insurance
investments
pension
personal_lending
savings
num_adults
num_kids
customer_gender
;
output out=want (keep= churn Customer_id_Ano
Month
family_id_Ano
Earning_Cust12M
Earning_Fam12M
customer_status_Numeric
account_basic_sum
csi
customer_trans_count
logons
meeting_n
overdraft_facility_sum
volume_investments
volume_loans
volume_pension_lifeins
volume_savings
LTV_current
Teller_trans_3m_last
customer_age
first_account_open
account_save_debit_transaction
account_save_credit_transaction
account_save
account_savings_volume
basic_bank_debit_transaction
basic_bank_credit_transaction
credit_card_debit_transaction
credit_card_credit_transaction
loan_home
loan_home_volume
meeting_with_client
atm_transaction
dialogue_with_advisor
secure_message_sent
teller_dialogue
basic_banking
day_to_day_finance
ebank_0
home_finance
insurance
investments
pension
personal_lending
savings
num_adults
num_kids
customer_gender r) rstudent=r;
run;
quit;
data new;
set want;
if abs(r)>2 then delete;
run;
proc univariate data=want plots plotsize=30;
var r;
run;
2) How can I check for variables multicollinearity with a longitudinal data and logistic regression? Can I use the VIF model?
3) Shall I split the dataset into Train and Validation? If yes, what is the code I can use when my dataset is a panel data? I must be sure that when SAS splits the dataset into train and validation it keeps untouched the panel data format of my dataset.
4) What procedure shall I use in SAS to run the logistic regression? I do not think thet PROC LOGISTIC is the right choice as it does not take into account the correlation between the 17 observations within a subject. Is the PROC GLIMMIX the one I might need?
I know I've addressed you many questions, but hopefully you'll be able to help.
Thank you in advance!
Hello,
You can use PROC GLIMMIX indeed.
But a good alternative is using PROC LOGISTIC to construct a “multinomial discrete-time logistic hazard regression” (in your case binary instead of multinomial).
This model also allows for time-dependent and time-varying covariates.
See this Enterprise Miner tip:
Tip: Getting Started with Survival Data Mining in SAS® Enterprise Miner™
See also this video:
New Features in the SAS® Enterprise Miner™ 12.3 Survival Node
https://www.youtube.com/watch?v=56X8MxVRrKk
This video makes use of a churn example!
The Survival Data Mining in Enterprise Miner relies on this parametric model: multinomial discrete-time logistic hazard regression
You should definitely use data partitioning (data splitting) to avoid overfitting.
Kind regards,
Koen
Brussels
Hello,
You can use PROC GLIMMIX indeed.
But a good alternative is using PROC LOGISTIC to construct a “multinomial discrete-time logistic hazard regression” (in your case binary instead of multinomial).
This model also allows for time-dependent and time-varying covariates.
See this Enterprise Miner tip:
Tip: Getting Started with Survival Data Mining in SAS® Enterprise Miner™
See also this video:
New Features in the SAS® Enterprise Miner™ 12.3 Survival Node
https://www.youtube.com/watch?v=56X8MxVRrKk
This video makes use of a churn example!
The Survival Data Mining in Enterprise Miner relies on this parametric model: multinomial discrete-time logistic hazard regression
You should definitely use data partitioning (data splitting) to avoid overfitting.
Kind regards,
Koen
Brussels
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.