Clarifications on solution to Case Study (transfer student project)

pvareschi · Posted 06-08-2020 01:25 PM

Re: Applied Analytics Using SAS Enterprise Miner

(this is a follow-up to a previous post)

I would be grateful if someone could clarify the following points on the solution presented on the final Case Study for this module (references/times refers to the videos presenting the template solution):

Q1. (Video "Exploring the Data", time 1:56) In the proposed solution, class variables with a mode percentage above 97% are excluded: how was 97% chosen? Is it based on a rule of thumb?

Would it not make sense to take an even lower threshold, such as 95 or 90%, since highly skewed class predictors are likely to be related to very specific sub-populations therefore they may not help a lot in terms of creating a predictive model that generalises across a whole population (from experience, that is even more important and relevant when working with a binary target)
Q2. (Video "Modifying the Data", time 0:24) The data is partitioned by taking a random sample stratified on all categorical variables: is that recommended and/or standard practice, even when there are many variables with potentially several levels? Would it not be enough to take a random sample with stratification based on the levels of the target variable only?
Q3. (Video "Modifying the Data") Imputation of missing values:

(a) (time 2:40) What is the purpose of defining the missing indicator variables as the number of variables with missing values? How are they used in the process/modelling?

(b) (time 3:10) A threshold of "60% missing values" is used to decide whether an estimation or synthetic distribution approach should be used. In general, is it correct to say that an estimation approach is appropriate only when the percentage of missing values is not too large?
(c) (time 3:25) What is the target variable used to assess the performance of the Decision Trees which are part of the data imputation process? Is it the original target variable?
Q4. (Video "Building a Set of Predictive Models")

(a) (time 0:39) Just to clarify, is the Variable Selection node set up to only affect interval variables (i.e. all categorical variables are left unchanged and set to Role=Input)? And those interval variables that are found to be non-significant, are set to rejected (thus not used by the following Regression nodes)?
(b) (time 1:25) The two Decision Tree models are fitted with no input screening preceding them: in general, should we be worried about fitting Decision Tree models when there are too many inputs, or can they deal with any number of input variables?

Said in other words, would a Decision Tree model benefit from applying an input screening before fitting the model?

Clarifications on solution to Case Study (transfer student project)

Click image to register for webinar

Classroom Training Available!