BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dschmidt
Fluorite | Level 6

Hi,

 

I have a dataset with ~170 binary dummy variables (clinical indicators) and close to 1,000,000 rows of data. I'm interested in modeling the 170 dummy variables as well as the second order interactions between them.

 

I believe there should be n*(n-1) / 2 = 14,365 2nd order interactions, which makes the dataset very wide (as well as long).

 

I've tried writing a model call to HPGENSELECT into a text file with all variables, but even with a small subset of observations (1000), the code ran for a very long time before I finally killed it. 

 

Do you have any suggestions for getting something like this to run? A valid answer is "that's a bad idea, why would you do that" 🙂

 

SAS Version: 9.04M5

 

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

Well, I think it's a bad idea, but that's just my opinion.  What you are describing is a throw everything at the data, look at what shows up and try to make sense of it approach.  However, I will wager a fair amount that you (or the literature) has some expert knowledge about the variables and their relative importance.  I would start there.  Then, rather than regression, I would consider approaches like decision trees and variable clustering.  Check out the SAS Data Mining and Machine Learning community for info on these approaches.

 

SteveDenham

View solution in original post

2 REPLIES 2
SteveDenham
Jade | Level 19

Well, I think it's a bad idea, but that's just my opinion.  What you are describing is a throw everything at the data, look at what shows up and try to make sense of it approach.  However, I will wager a fair amount that you (or the literature) has some expert knowledge about the variables and their relative importance.  I would start there.  Then, rather than regression, I would consider approaches like decision trees and variable clustering.  Check out the SAS Data Mining and Machine Learning community for info on these approaches.

 

SteveDenham

dschmidt
Fluorite | Level 6

Thanks, appreciate the feedback! I was planning on using the lasso as an exploratory tool, but I suppose something like a decision tree might do the same job better 🙂

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 613 views
  • 1 like
  • 2 in conversation