BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pvareschi
Quartz | Level 8

Re: Applied Analytics Using SAS Enterprise Miner

I just want to check I have understood correctly what problems are caused by an excessive number of inputs and/or levels of categorical variables (page 3-15 of course text):
1. input space becomes sparse, making it difficult to obtain accurate estimate of parameters
2. increase difficulty in identifying "true relationships" vs "spurious relationships" due to excessive noise in the data; moreover, the more inputs we have, the more likely it is some of them will seem "significant" by pure chance (type I error)
3. it may become more difficult to screen inputs because of increase collinearity among inputs
4. the risk of overfitting is likely to increase, especially when using categorical inputs with many levels
5. quasi-separation is also likely to occur; in particular, when levels with low count of cases (i.e. rare categories) are present

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

My opinions

  1. Sparse data does not necessarily cause difficulty in getting accurate estimates. Collinearity causes more problems than sparse data with obtaining estimates, and the problem is usually not accuracy, the problem is usually lack of precision, or high variability.
  2. Excessive noise in the data is not necessarily a function of dimensionality. You can have low dimensional data with excessive noise and high dimensional data that has relatively low noise. I do agree with this part: "the more likely it is some of them will seem "significant" by pure chance (type I error)"
  3. "it may become more difficult to screen inputs because of increase collinearity among inputs" is not how I would phrase it — determining which inputs are good predictors is not difficult in high dimensional space with collinearity; developing a good predictive model is the difficult part in the presence of collinearity, and many techniques handle the collinearity poorly. However, in that situation, you can use Partial Least Squares regression which mitigates the problem of collinearity. According to SAS's Randall Tobias: "Partial least squares (PLS) is a method for constructing predictive models when the factors are many and highly collinear." And there are hundreds (thousands? tens of thousands?) of published articles where PLS is used successfully in such situations of high collinearity. Unfortunately, much of SAS training for data analytics, Enterprise Miner and Viya tends to ignore this method of Partial Least Squares, despite its proven effectiveness, and in my mind is a major deficiency of such training.
  4. Categorical variables with many levels can be a problem, and can lead to overfitting; but they are not the only cause of overfitting, which is indeed a problem with high dimensional spaces where the inputs are collinear.
  5. But again, this is not necessarily due to high dimensionality, there are examples of quasi-separation in low dimensional space as well. And I must say that I have run many logistic regressions in my life with 10-20 input variables, and I don't ever remember getting the quasi-separation warning from SAS.
--
Paige Miller

View solution in original post

1 REPLY 1
PaigeMiller
Diamond | Level 26

My opinions

  1. Sparse data does not necessarily cause difficulty in getting accurate estimates. Collinearity causes more problems than sparse data with obtaining estimates, and the problem is usually not accuracy, the problem is usually lack of precision, or high variability.
  2. Excessive noise in the data is not necessarily a function of dimensionality. You can have low dimensional data with excessive noise and high dimensional data that has relatively low noise. I do agree with this part: "the more likely it is some of them will seem "significant" by pure chance (type I error)"
  3. "it may become more difficult to screen inputs because of increase collinearity among inputs" is not how I would phrase it — determining which inputs are good predictors is not difficult in high dimensional space with collinearity; developing a good predictive model is the difficult part in the presence of collinearity, and many techniques handle the collinearity poorly. However, in that situation, you can use Partial Least Squares regression which mitigates the problem of collinearity. According to SAS's Randall Tobias: "Partial least squares (PLS) is a method for constructing predictive models when the factors are many and highly collinear." And there are hundreds (thousands? tens of thousands?) of published articles where PLS is used successfully in such situations of high collinearity. Unfortunately, much of SAS training for data analytics, Enterprise Miner and Viya tends to ignore this method of Partial Least Squares, despite its proven effectiveness, and in my mind is a major deficiency of such training.
  4. Categorical variables with many levels can be a problem, and can lead to overfitting; but they are not the only cause of overfitting, which is indeed a problem with high dimensional spaces where the inputs are collinear.
  5. But again, this is not necessarily due to high dimensionality, there are examples of quasi-separation in low dimensional space as well. And I must say that I have run many logistic regressions in my life with 10-20 input variables, and I don't ever remember getting the quasi-separation warning from SAS.
--
Paige Miller