BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pvareschi
Quartz | Level 8

Re: Predictive Modeling Using Logistic Regression (page 3.6-3.7 of course text)

Q1. Given that Missing Indicator variables are created with the purpose of "capturing the relationship of missingness with the Target", in order to keep things manageable, would it not make sense to check immediately whether they are correlated with the target and if not, just drop them from the dataset, even before start applying inputs screening for redundancy and irrelevancy?
Q2. How does the percentage of missingness in a variable impact the choice of what imputation method we should use? (i.e. synthetic vs distributional)?
Q3. When would it be appropriate to use "random imputation" (i.e. based on the observed distribution of cases - this is value "Distribution" of Impute Node in Enterprise Miner)? Would it be when missing values do not have a relationship with the target?

Q4. Should imputation of missing values for a categorical variable be done before or after collapsing the levels?
Q5. How should ordinal variables be handled in terms of missing values? The point is that ordinal variables are between categorical and interval, therefore, replacing missing values with the mode may not always be appropriate; likewise, treating missing as a separate category would essentially make the variable a categorical one

1 ACCEPTED SOLUTION

Accepted Solutions
gcjfernandez
SAS Employee

Re: Predictive Modeling Using Logistic Regression (page 3.6-3.7 of course text)

Q1. Given that Missing Indicator variables are created with the purpose of "capturing the relationship of missingness with the Target", in order to keep things manageable, would it not make sense to check immediately whether they are correlated with the target and if not, just drop them from the dataset, even before start applying inputs screening for redundancy and irrelevancy?

My Answer:

This PMLR course is based on writing SAS code. Therefore we can perform any relevant analysis in any sequence by writing our own code. Therefore it make more sense to drop any redundant or irrelevant Missing value indicator variable  by writing custom code.


Q2. How does the percentage of missingness in a variable impact the choice of what imputation method we should use? (i.e. synthetic vs distributional)?

My answer:

The percentage of missingness in a variable does not impact the choice of what imputation method. It is used to decide whether to keep or drop the variable from the analysis. I think  if the % of missingness is more than 50% the variable is dropped from the analysis.


Q3. When would it be appropriate to use "random imputation" (i.e. based on the observed distribution of cases - this is value "Distribution" of Impute Node in Enterprise Miner)? Would it be when missing values do not have a relationship with the target?

My Answer:

If the missing values are considered MCAR (Missing completely at random)the random imputation methods can be appropriate.

 

Q4. Should imputation of missing values for a categorical variable be done before or after collapsing the levels?

My answer:

Missing value imputation (for both interval or categorical) should be the last step before running the model step. Therefore Missing value imputation for categorical should be performed after recoding the categorical levels.
Q5. How should ordinal variables be handled in terms of missing values? The point is that ordinal variables are between categorical and interval, therefore, replacing missing values with the mode may not always be appropriate; likewise, treating missing as a separate category would essentially make the variable a categorical one

My answer:

Usually the ordinal variables are treated as categorical and  missing value imputation methods for categorical variables are applied. But if you want to try advanced methods, you can try decisions tree  based imputation  method or treat the ordinal scale to interval and try optimal binning and assign missing record to appropriate bins. 

View solution in original post

1 REPLY 1
gcjfernandez
SAS Employee

Re: Predictive Modeling Using Logistic Regression (page 3.6-3.7 of course text)

Q1. Given that Missing Indicator variables are created with the purpose of "capturing the relationship of missingness with the Target", in order to keep things manageable, would it not make sense to check immediately whether they are correlated with the target and if not, just drop them from the dataset, even before start applying inputs screening for redundancy and irrelevancy?

My Answer:

This PMLR course is based on writing SAS code. Therefore we can perform any relevant analysis in any sequence by writing our own code. Therefore it make more sense to drop any redundant or irrelevant Missing value indicator variable  by writing custom code.


Q2. How does the percentage of missingness in a variable impact the choice of what imputation method we should use? (i.e. synthetic vs distributional)?

My answer:

The percentage of missingness in a variable does not impact the choice of what imputation method. It is used to decide whether to keep or drop the variable from the analysis. I think  if the % of missingness is more than 50% the variable is dropped from the analysis.


Q3. When would it be appropriate to use "random imputation" (i.e. based on the observed distribution of cases - this is value "Distribution" of Impute Node in Enterprise Miner)? Would it be when missing values do not have a relationship with the target?

My Answer:

If the missing values are considered MCAR (Missing completely at random)the random imputation methods can be appropriate.

 

Q4. Should imputation of missing values for a categorical variable be done before or after collapsing the levels?

My answer:

Missing value imputation (for both interval or categorical) should be the last step before running the model step. Therefore Missing value imputation for categorical should be performed after recoding the categorical levels.
Q5. How should ordinal variables be handled in terms of missing values? The point is that ordinal variables are between categorical and interval, therefore, replacing missing values with the mode may not always be appropriate; likewise, treating missing as a separate category would essentially make the variable a categorical one

My answer:

Usually the ordinal variables are treated as categorical and  missing value imputation methods for categorical variables are applied. But if you want to try advanced methods, you can try decisions tree  based imputation  method or treat the ordinal scale to interval and try optimal binning and assign missing record to appropriate bins.