BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Njoshi
Calcite | Level 5

I am trying to see multicollinearity while building predictive model. I am having difficulty in figure out how to find that out using SAS Enterprise Miner.

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

These metrics are not generated by Enterprise Miner.  Enterprise Miner is designed for processing large data sets with a large number of variables for which it would be impractical to evaluate these and other typical regression diagnostics.  Additionally, the number of variables and observations involved often accompany a nontrivial number of missing values.  Note that even when you have only one missing values for a variable, you cannot use the entire observation.  I was once given a data set with 25,000 observations (small by data mining standards) for which there were only 25 complete observations.  

 

For this reason, data mining methods like regression that require complete data need to have missing values imputed before modeling. This imputation artificially creates data (for 24,975 out of 25,000 observations in my example above) which necessarily calls many of the classical regression diagnostics into question, because you must now question the error estimates which means that all of the statistical tests, confidence limits, and most of the diagnostics are also called into question.  For this reason, many of these classical statistics are not produced by SAS Enterprise Miner.   

 

Thankfully, the presence of a large number of observations means that you will typically have holdout data to validate the model empirically.   Rather than relying on statistical assumptions, you can break your data into two or three representative samples.  A model that works on both the training data and the validation and/or test data can be trusted even when multicollinearity is present.   

View solution in original post

2 REPLIES 2
DougWielenga
SAS Employee

These metrics are not generated by Enterprise Miner.  Enterprise Miner is designed for processing large data sets with a large number of variables for which it would be impractical to evaluate these and other typical regression diagnostics.  Additionally, the number of variables and observations involved often accompany a nontrivial number of missing values.  Note that even when you have only one missing values for a variable, you cannot use the entire observation.  I was once given a data set with 25,000 observations (small by data mining standards) for which there were only 25 complete observations.  

 

For this reason, data mining methods like regression that require complete data need to have missing values imputed before modeling. This imputation artificially creates data (for 24,975 out of 25,000 observations in my example above) which necessarily calls many of the classical regression diagnostics into question, because you must now question the error estimates which means that all of the statistical tests, confidence limits, and most of the diagnostics are also called into question.  For this reason, many of these classical statistics are not produced by SAS Enterprise Miner.   

 

Thankfully, the presence of a large number of observations means that you will typically have holdout data to validate the model empirically.   Rather than relying on statistical assumptions, you can break your data into two or three representative samples.  A model that works on both the training data and the validation and/or test data can be trusted even when multicollinearity is present.   

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 6237 views
  • 0 likes
  • 3 in conversation