Hi,
I have a dataset with ~170 binary dummy variables (clinical indicators) and close to 1,000,000 rows of data. I'm interested in modeling the 170 dummy variables as well as the second order interactions between them.
I believe there should be n*(n-1) / 2 = 14,365 2nd order interactions, which makes the dataset very wide (as well as long).
I've tried writing a model call to HPGENSELECT into a text file with all variables, but even with a small subset of observations (1000), the code ran for a very long time before I finally killed it.
Do you have any suggestions for getting something like this to run? A valid answer is "that's a bad idea, why would you do that" 🙂
SAS Version: 9.04M5
Thanks in advance!
Well, I think it's a bad idea, but that's just my opinion. What you are describing is a throw everything at the data, look at what shows up and try to make sense of it approach. However, I will wager a fair amount that you (or the literature) has some expert knowledge about the variables and their relative importance. I would start there. Then, rather than regression, I would consider approaches like decision trees and variable clustering. Check out the SAS Data Mining and Machine Learning community for info on these approaches.
SteveDenham
Well, I think it's a bad idea, but that's just my opinion. What you are describing is a throw everything at the data, look at what shows up and try to make sense of it approach. However, I will wager a fair amount that you (or the literature) has some expert knowledge about the variables and their relative importance. I would start there. Then, rather than regression, I would consider approaches like decision trees and variable clustering. Check out the SAS Data Mining and Machine Learning community for info on these approaches.
SteveDenham
Thanks, appreciate the feedback! I was planning on using the lasso as an exploratory tool, but I suppose something like a decision tree might do the same job better 🙂
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.