Solved: Clarifications on handling missing values

pvareschi · Posted 06-08-2020 07:57 AM

Re: Predictive Modeling Using Logistic Regression (page 3.6-3.7 of course text)

Q1. Given that Missing Indicator variables are created with the purpose of "capturing the relationship of missingness with the Target", in order to keep things manageable, would it not make sense to check immediately whether they are correlated with the target and if not, just drop them from the dataset, even before start applying inputs screening for redundancy and irrelevancy?
Q2. How does the percentage of missingness in a variable impact the choice of what imputation method we should use? (i.e. synthetic vs distributional)?
Q3. When would it be appropriate to use "random imputation" (i.e. based on the observed distribution of cases - this is value "Distribution" of Impute Node in Enterprise Miner)? Would it be when missing values do not have a relationship with the target?

Q4. Should imputation of missing values for a categorical variable be done before or after collapsing the levels?
Q5. How should ordinal variables be handled in terms of missing values? The point is that ordinal variables are between categorical and interval, therefore, replacing missing values with the mode may not always be appropriate; likewise, treating missing as a separate category would essentially make the variable a categorical one

gcjfernandez · Posted 06-10-2020 02:14 AM

Re: Predictive Modeling Using Logistic Regression (page 3.6-3.7 of course text)

Q1. Given that Missing Indicator variables are created with the purpose of "capturing the relationship of missingness with the Target", in order to keep things manageable, would it not make sense to check immediately whether they are correlated with the target and if not, just drop them from the dataset, even before start applying inputs screening for redundancy and irrelevancy?

My Answer:

This PMLR course is based on writing SAS code. Therefore we can perform any relevant analysis in any sequence by writing our own code. Therefore it make more sense to drop any redundant or irrelevant Missing value indicator variable by writing custom code.

Q2. How does the percentage of missingness in a variable impact the choice of what imputation method we should use? (i.e. synthetic vs distributional)?

My answer:

The percentage of missingness in a variable does not impact the choice of what imputation method. It is used to decide whether to keep or drop the variable from the analysis. I think if the % of missingness is more than 50% the variable is dropped from the analysis.

Q3. When would it be appropriate to use "random imputation" (i.e. based on the observed distribution of cases - this is value "Distribution" of Impute Node in Enterprise Miner)? Would it be when missing values do not have a relationship with the target?

My Answer:

If the missing values are considered MCAR (Missing completely at random)the random imputation methods can be appropriate.

Q4. Should imputation of missing values for a categorical variable be done before or after collapsing the levels?

My answer:

Missing value imputation (for both interval or categorical) should be the last step before running the model step. Therefore Missing value imputation for categorical should be performed after recoding the categorical levels.
Q5. How should ordinal variables be handled in terms of missing values? The point is that ordinal variables are between categorical and interval, therefore, replacing missing values with the mode may not always be appropriate; likewise, treating missing as a separate category would essentially make the variable a categorical one

My answer:

Usually the ordinal variables are treated as categorical and missing value imputation methods for categorical variables are applied. But if you want to try advanced methods, you can try decisions tree based imputation method or treat the ordinal scale to interval and try optimal binning and assign missing record to appropriate bins.

View solution in original post

gcjfernandez · Posted 06-10-2020 02:14 AM

Re: Predictive Modeling Using Logistic Regression (page 3.6-3.7 of course text)

Q1. Given that Missing Indicator variables are created with the purpose of "capturing the relationship of missingness with the Target", in order to keep things manageable, would it not make sense to check immediately whether they are correlated with the target and if not, just drop them from the dataset, even before start applying inputs screening for redundancy and irrelevancy?

My Answer:

This PMLR course is based on writing SAS code. Therefore we can perform any relevant analysis in any sequence by writing our own code. Therefore it make more sense to drop any redundant or irrelevant Missing value indicator variable by writing custom code.

Q2. How does the percentage of missingness in a variable impact the choice of what imputation method we should use? (i.e. synthetic vs distributional)?

My answer:

The percentage of missingness in a variable does not impact the choice of what imputation method. It is used to decide whether to keep or drop the variable from the analysis. I think if the % of missingness is more than 50% the variable is dropped from the analysis.

Q3. When would it be appropriate to use "random imputation" (i.e. based on the observed distribution of cases - this is value "Distribution" of Impute Node in Enterprise Miner)? Would it be when missing values do not have a relationship with the target?

My Answer:

If the missing values are considered MCAR (Missing completely at random)the random imputation methods can be appropriate.

Q4. Should imputation of missing values for a categorical variable be done before or after collapsing the levels?

My answer:

Missing value imputation (for both interval or categorical) should be the last step before running the model step. Therefore Missing value imputation for categorical should be performed after recoding the categorical levels.
Q5. How should ordinal variables be handled in terms of missing values? The point is that ordinal variables are between categorical and interval, therefore, replacing missing values with the mode may not always be appropriate; likewise, treating missing as a separate category would essentially make the variable a categorical one

My answer:

Usually the ordinal variables are treated as categorical and missing value imputation methods for categorical variables are applied. But if you want to try advanced methods, you can try decisions tree based imputation method or treat the ordinal scale to interval and try optimal binning and assign missing record to appropriate bins.

Clarifications on handling missing values

Re: Clarifications on handling missing values

Re: Clarifications on handling missing values

Clarifications on handling missing values

Re: Clarifications on handling missing values

Re: Clarifications on handling missing values

Click image to register for webinar

Classroom Training Available!