BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
lichee
Quartz | Level 8

Hi all,

I'm trying to use PROC HPGENSELECT with METHOD=LASSO to select covariates. I have about 370,000 observations. When I tested the code with only 8 covariates, it finished running in less than twenty minutes. When I included over 300 binary covariates (&add300cov.), it ran overnight but still did not finish. Can anyone see anything to modify in the my code below? Thanks a lot!

proc hpgenselect data=population;

       class female(ref='0') race_ethncty(ref='1') &add300cov_ref.;

        model success(event="1") = age female race_ethncty var4 var5 var6 var7 var8  &add300cov./ dist=binary include=(age female race_ethncty var4 var5 var6 var7 var8);

        selection method=lasso;

run;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

300 binary covariates most likely will take an extremely long time to fit such a model. I don't think there's any way around that. Even with an HP PROC. But there are optimization options and tolerance options for PROC HPGENSELECT, you could try those and see if anything helps (I'm guessing they would)

 

As far as what else to try, you could see which of the binary predictors are highly correlated with the binary response using a Chi-squared test (two way table in PROC FREQ with the CHISQ option will get you there) and just pick a few of the best binary predictors to use in the model.

 

Thinking out of the box, another approach is to use logistic Partial Least Squares with all 300 binary predictors. I have no doubt that even with 300 binary predictors it would finish much more quickly, and I would be surprised if it even took an hour. However, the only software to do this that I know of is in R (https://cran.r-project.org/web/packages/plsRglm/plsRglm.pdf)

--
Paige Miller

View solution in original post

2 REPLIES 2
PaigeMiller
Diamond | Level 26

300 binary covariates most likely will take an extremely long time to fit such a model. I don't think there's any way around that. Even with an HP PROC. But there are optimization options and tolerance options for PROC HPGENSELECT, you could try those and see if anything helps (I'm guessing they would)

 

As far as what else to try, you could see which of the binary predictors are highly correlated with the binary response using a Chi-squared test (two way table in PROC FREQ with the CHISQ option will get you there) and just pick a few of the best binary predictors to use in the model.

 

Thinking out of the box, another approach is to use logistic Partial Least Squares with all 300 binary predictors. I have no doubt that even with 300 binary predictors it would finish much more quickly, and I would be surprised if it even took an hour. However, the only software to do this that I know of is in R (https://cran.r-project.org/web/packages/plsRglm/plsRglm.pdf)

--
Paige Miller
ballardw
Super User

By any chance did you turn on any system performance monitor while that code was running?

 

I suspect you might see a lot of disk activity and possibly very high memory usage. When there are lots of things involved with calculations you might find that SAS is spending more time writing/reading data to and from temporary storage.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 244 views
  • 1 like
  • 3 in conversation