BookmarkSubscribeRSS Feed
novinosrin
Tourmaline | Level 20

Hi SAS Stat folks et al,

 

Requesting tips to deal with missing values?

 

I have a sample  customer dataset derived using SRS  that has  a variable named education among several other variables

EDUCATION (categorical): code 1 to 6 depending on education levels

Example, 

GENDER EDUCATION
M 2
M .
F 1
M 1
F 1
M 2
M 4
M .
M 1
M 2
M .
M 1
F 1
F .
M 1

 

I am working on a project to identify/predict retention and churn of customers and I m afraid I am encountering many missing values for the variable Education as shown above. 

I did a proc freq to see the number of missing which is 205 out of 983 obs i.e about 21%. To me this looks/sounds much too high to discard? is my notion right? if yes, what's the best way to impute the missings and kindly suggest me how to proceed from here?

 

Btw, it's a university summer project, you can regard this as a HW. 
6 REPLIES 6
PaigeMiller
Diamond | Level 26

Only the people who know your data can say if it is too high or not.

 

In this simple example, probably the best thing to do is to leave the missing as missing, it's own distinct category. In this simple example, there really is no way to impute a value, nor would I recommend it; imputing usually requires one or more variables that can predict education level. I suppose you can use hot-deck or cold-deck imputation (look them up) but it really seems forced and out-of-place in this example.

--
Paige Miller
ballardw
Super User

Depending on your data source(s) there may be some metadata available to indicate why values may be missing. There might be difference to consider for modeling between missing because not collected for some records and missing due to refusal of respondent for others. If that could be identified you may have 2 additional response categories to consider (or more IF the reasons are available).

 

I do have one project that uses "education" levels and have to derive some way to map other country education systems to the report categories. Perhaps you have a similar situation where a coder took an easy way out of setting to missing because they did not have a good protocol for such values. Perhaps there is a text field with other education related information in the file or a related data set?

novinosrin
Tourmaline | Level 20

Thank you Both @PaigeMiller and @ballardw, your inputs is leading me to go back and ask more questions to the people who provided me the following in the first place:

 

Given the large number of competitors, mobile service providers are very interested in analyzing and predicting customer retention and churn. The primary goal of churn analysis is to identify those customers that are most likely to discontinue using your service or product. The dataset contains information about a random sample of customers of a mobile service company. For each customer, company recorded the following variables:

  1. CHURN: 1 if customer switched provider, 0 if customer did not switch
  2. GENDER: M, F
  3. EDUCATION (categorical): code 1 to 6 depending on education levels
  4. LAST_PRICE_PLAN_CHNG_DAY_CNT: No. of days since last price plan change
  5. TOT_ACTV_SRV_CNT: Total no. of active services
  6. AGE: customer age
  7. PCT_CHNG_IB_SMS_CNT: Percent change of latest 2 months incoming SMS wrt previous 4 months incoming SMS
  8. PCT_CHNG_BILL_AMT: Percent change of latest 2 months bill amount wrt previous 4 months bill amount
  9. COMPLAINT: 1 if there was at least a customer’s complaint in the two months, 0 no complaint

 

The company is interested in a churn predictive model that identifies the most important predictors affecting probability of switching to a different mobile phone company

Ksharp
Super User

What do you want ? You could impute it by mean or lag(var) .

If you are doing Regression, I suggest using

  PROC PLS  + missing=em  

Or PROC MI

novinosrin
Tourmaline | Level 20

Thank you @Ksharp Just not sure if mean imputation would make sense for education variable although lag(var) sounds like a good idea. Yes, I am doing regression analysis, will look into proc pls and mi. Thanks again

PaigeMiller
Diamond | Level 26

I too have used PROC PLS to do imputation of missing values, but in this specific case posted by @novinosrin, I don't see how any of the other variables are good predictors of EDUCATION. Maybe I'm wrong, but just by reading the variable names and their explanation, I don't see it. EDUCATION might be predicted by demographics, but there aren't really demographic information in the variables described.

 

Similarly, in this specific case, I don't see how PROC MI would produce reasonable imputations of EDUCATION. I don't recommend the cold deck or hot deck approach either, I would not do any imputation, I would make the missing EDUCATION cases as a valid separate level of EDUCATION, and work with it that way, as EDUCATION being a categorical variable with 7 levels, missing and 1 through 6.

--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1314 views
  • 4 likes
  • 4 in conversation