Programming the statistical procedures from SAS

Performance Question

Reply
Frequent Contributor
Posts: 99

Performance Question

Hi everyone,
I am trying to run a proc logistic regression model on around 1.3 million rows. I have been able to reduce the variables to 30.
If I run a all subset model with no interactions it is around 2^30 = 1,073,741,824 runs.

1> Is thier a way to find out how much time will it take?
2> Is their any technique to perform these many runs quicker?

Currently I am using stepwise, backward etc to run the model. Also I have PC SAS 9.2 (TS2M0) X64_ESRV platform


Thanks,

Amit
Trusted Advisor
Posts: 2,114

Re: Performance Question

There is a section on computational resources in the reference manual for LOGISTIC.

I think that you are on thin ice with this all-possible-regressions approach. You will get estimates that fit the best for your data, but that are unlikely to be reproducible. With 1.3M observations, you are also likely to find predictors that are statistically significant without having any business value.
Frequent Contributor
Posts: 99

Re: Performance Question

Hi ,
Thanks for the reply. I will look into the reference manual.

From your comment regarding 1.3M observations, I gather that logistics regression will fail to provide good predictive models. ( I am using train and validate datasets for model building)

1> Is logistics regression not suited for large datasets?
2> What methods would provide better predictive power for large datasets?

I appreciate all your help.

Regards,

Amit
Trusted Advisor
Posts: 2,114

Re: Performance Question

"I gather that logistics regression will fail to provide good predictive models." <-- that is not what I said at all. A good predictive model may have variables in it that predict well in a statistical sense (e.g. significant p-value), but are not useful for business decision making.

My other comment ("thin ice") referred to the classic problem in statistics of "multiple comparisons". If you do a billion analyses on 1.3 million observations, as you described in your initial post, you are going to get some models that predict well but are wrong in a business sense. It is not a problem with logistic regression, it is a problem with misapplication.
Frequent Contributor
Posts: 99

Re: Performance Question

It makes sense. I will take it into consideration.

Thanks for all your help.

Regards,

Amit
SAS Employee
Posts: 245

Re: Performance Question

See the "Using SELECTION= with many variables" and "Large input data set" sections of this usage note for some ideas:

http://support.sas.com/kb/22607
Ask a Question
Discussion stats
  • 5 replies
  • 138 views
  • 0 likes
  • 3 in conversation