BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dschmidt
Fluorite | Level 6

Hi,

 

I have a dataset with ~170 binary dummy variables (clinical indicators) and close to 1,000,000 rows of data. I'm interested in modeling the 170 dummy variables as well as the second order interactions between them.

 

I believe there should be n*(n-1) / 2 = 14,365 2nd order interactions, which makes the dataset very wide (as well as long).

 

I've tried writing a model call to HPGENSELECT into a text file with all variables, but even with a small subset of observations (1000), the code ran for a very long time before I finally killed it. 

 

Do you have any suggestions for getting something like this to run? A valid answer is "that's a bad idea, why would you do that" 🙂

 

SAS Version: 9.04M5

 

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

Well, I think it's a bad idea, but that's just my opinion.  What you are describing is a throw everything at the data, look at what shows up and try to make sense of it approach.  However, I will wager a fair amount that you (or the literature) has some expert knowledge about the variables and their relative importance.  I would start there.  Then, rather than regression, I would consider approaches like decision trees and variable clustering.  Check out the SAS Data Mining and Machine Learning community for info on these approaches.

 

SteveDenham

View solution in original post

2 REPLIES 2
SteveDenham
Jade | Level 19

Well, I think it's a bad idea, but that's just my opinion.  What you are describing is a throw everything at the data, look at what shows up and try to make sense of it approach.  However, I will wager a fair amount that you (or the literature) has some expert knowledge about the variables and their relative importance.  I would start there.  Then, rather than regression, I would consider approaches like decision trees and variable clustering.  Check out the SAS Data Mining and Machine Learning community for info on these approaches.

 

SteveDenham

dschmidt
Fluorite | Level 6

Thanks, appreciate the feedback! I was planning on using the lasso as an exploratory tool, but I suppose something like a decision tree might do the same job better 🙂

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 435 views
  • 1 like
  • 2 in conversation