Requesting tips to deal with missing values?

novinosrin · Posted 08-08-2018 01:31 PM

Hi SAS Stat folks et al,

I have a sample customer dataset derived using SRS that has a variable named education among several other variables

EDUCATION (categorical): code 1 to 6 depending on education levels

Example,

GENDER	EDUCATION
M	2
M	.
F	1
M	1
F	1
M	2
M	4
M	.
M	1
M	2
M	.
M	1
F	1
F	.
M	1

I am working on a project to identify/predict retention and churn of customers and I m afraid I am encountering many missing values for the variable Education as shown above.

I did a proc freq to see the number of missing which is 205 out of 983 obs i.e about 21%. To me this looks/sounds much too high to discard? is my notion right? if yes, what's the best way to impute the missings and kindly suggest me how to proceed from here?

Btw, it's a university summer project, you can regard this as a HW.

PaigeMiller · Posted 08-08-2018 01:56 PM

Only the people who know your data can say if it is too high or not.

In this simple example, probably the best thing to do is to leave the missing as missing, it's own distinct category. In this simple example, there really is no way to impute a value, nor would I recommend it; imputing usually requires one or more variables that can predict education level. I suppose you can use hot-deck or cold-deck imputation (look them up) but it really seems forced and out-of-place in this example.

--
Paige Miller

ballardw · Posted 08-08-2018 02:10 PM

Depending on your data source(s) there may be some metadata available to indicate why values may be missing. There might be difference to consider for modeling between missing because not collected for some records and missing due to refusal of respondent for others. If that could be identified you may have 2 additional response categories to consider (or more IF the reasons are available).

I do have one project that uses "education" levels and have to derive some way to map other country education systems to the report categories. Perhaps you have a similar situation where a coder took an easy way out of setting to missing because they did not have a good protocol for such values. Perhaps there is a text field with other education related information in the file or a related data set?

novinosrin · Posted 08-08-2018 02:16 PM

Thank you Both @PaigeMiller and @ballardw, your inputs is leading me to go back and ask more questions to the people who provided me the following in the first place:

Given the large number of competitors, mobile service providers are very interested in analyzing and predicting customer retention and churn. The primary goal of churn analysis is to identify those customers that are most likely to discontinue using your service or product. The dataset contains information about a random sample of customers of a mobile service company. For each customer, company recorded the following variables:

CHURN: 1 if customer switched provider, 0 if customer did not switch
GENDER: M, F
EDUCATION (categorical): code 1 to 6 depending on education levels
LAST_PRICE_PLAN_CHNG_DAY_CNT: No. of days since last price plan change
TOT_ACTV_SRV_CNT: Total no. of active services
AGE: customer age
PCT_CHNG_IB_SMS_CNT: Percent change of latest 2 months incoming SMS wrt previous 4 months incoming SMS
PCT_CHNG_BILL_AMT: Percent change of latest 2 months bill amount wrt previous 4 months bill amount
COMPLAINT: 1 if there was at least a customer’s complaint in the two months, 0 no complaint

The company is interested in a churn predictive model that identifies the most important predictors affecting probability of switching to a different mobile phone company

Ksharp · Posted 08-09-2018 09:05 AM

What do you want ? You could impute it by mean or lag(var) .

If you are doing Regression, I suggest using

PROC PLS + missing=em

Or PROC MI

novinosrin · Posted 08-09-2018 09:44 AM

Thank you @Ksharp Just not sure if mean imputation would make sense for education variable although lag(var) sounds like a good idea. Yes, I am doing regression analysis, will look into proc pls and mi. Thanks again

PaigeMiller · Posted 08-09-2018 10:15 AM

I too have used PROC PLS to do imputation of missing values, but in this specific case posted by @novinosrin, I don't see how any of the other variables are good predictors of EDUCATION. Maybe I'm wrong, but just by reading the variable names and their explanation, I don't see it. EDUCATION might be predicted by demographics, but there aren't really demographic information in the variables described.

Similarly, in this specific case, I don't see how PROC MI would produce reasonable imputations of EDUCATION. I don't recommend the cold deck or hot deck approach either, I would not do any imputation, I would make the missing EDUCATION cases as a valid separate level of EDUCATION, and work with it that way, as EDUCATION being a categorical variable with 7 levels, missing and 1 through 6.

--
Paige Miller

Requesting tips to deal with missing values?

Re: Requesting tips to deal with missing values?

Re: Requesting tips to deal with missing values?

Re: Requesting tips to deal with missing values?

Re: Requesting tips to deal with missing values?

Re: Requesting tips to deal with missing values?

Re: Requesting tips to deal with missing values?