Re: Applied Analytics Using SAS Enterprise Miner
I just want to check I have understood correctly what problems are caused by an excessive number of inputs and/or levels of categorical variables (page 3-15 of course text):
1. input space becomes sparse, making it difficult to obtain accurate estimate of parameters
2. increase difficulty in identifying "true relationships" vs "spurious relationships" due to excessive noise in the data; moreover, the more inputs we have, the more likely it is some of them will seem "significant" by pure chance (type I error)
3. it may become more difficult to screen inputs because of increase collinearity among inputs
4. the risk of overfitting is likely to increase, especially when using categorical inputs with many levels
5. quasi-separation is also likely to occur; in particular, when levels with low count of cases (i.e. rare categories) are present