BookmarkSubscribeRSS Feed
AmitKB
Fluorite | Level 6
Hi everyone,
I am trying to run a proc logistic regression model on around 1.3 million rows. I have been able to reduce the variables to 30.
If I run a all subset model with no interactions it is around 2^30 = 1,073,741,824 runs.

1> Is thier a way to find out how much time will it take?
2> Is their any technique to perform these many runs quicker?

Currently I am using stepwise, backward etc to run the model. Also I have PC SAS 9.2 (TS2M0) X64_ESRV platform


Thanks,

Amit
5 REPLIES 5
Doc_Duke
Rhodochrosite | Level 12
There is a section on computational resources in the reference manual for LOGISTIC.

I think that you are on thin ice with this all-possible-regressions approach. You will get estimates that fit the best for your data, but that are unlikely to be reproducible. With 1.3M observations, you are also likely to find predictors that are statistically significant without having any business value.
AmitKB
Fluorite | Level 6
Hi ,
Thanks for the reply. I will look into the reference manual.

From your comment regarding 1.3M observations, I gather that logistics regression will fail to provide good predictive models. ( I am using train and validate datasets for model building)

1> Is logistics regression not suited for large datasets?
2> What methods would provide better predictive power for large datasets?

I appreciate all your help.

Regards,

Amit
Doc_Duke
Rhodochrosite | Level 12
"I gather that logistics regression will fail to provide good predictive models." <-- that is not what I said at all. A good predictive model may have variables in it that predict well in a statistical sense (e.g. significant p-value), but are not useful for business decision making.

My other comment ("thin ice") referred to the classic problem in statistics of "multiple comparisons". If you do a billion analyses on 1.3 million observations, as you described in your initial post, you are going to get some models that predict well but are wrong in a business sense. It is not a problem with logistic regression, it is a problem with misapplication.
AmitKB
Fluorite | Level 6
It makes sense. I will take it into consideration.

Thanks for all your help.

Regards,

Amit
StatDave
SAS Super FREQ
See the "Using SELECTION= with many variables" and "Large input data set" sections of this usage note for some ideas:

http://support.sas.com/kb/22607

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1310 views
  • 0 likes
  • 3 in conversation