Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Performance Question

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 01-07-2010 11:13 AM
(1309 views)

Hi everyone,

I am trying to run a proc logistic regression model on around 1.3 million rows. I have been able to reduce the variables to 30.

If I run a all subset model with no interactions it is around 2^30 = 1,073,741,824 runs.

1> Is thier a way to find out how much time will it take?

2> Is their any technique to perform these many runs quicker?

Currently I am using stepwise, backward etc to run the model. Also I have PC SAS 9.2 (TS2M0) X64_ESRV platform

Thanks,

Amit

I am trying to run a proc logistic regression model on around 1.3 million rows. I have been able to reduce the variables to 30.

If I run a all subset model with no interactions it is around 2^30 = 1,073,741,824 runs.

1> Is thier a way to find out how much time will it take?

2> Is their any technique to perform these many runs quicker?

Currently I am using stepwise, backward etc to run the model. Also I have PC SAS 9.2 (TS2M0) X64_ESRV platform

Thanks,

Amit

5 REPLIES 5

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

There is a section on computational resources in the reference manual for LOGISTIC.

I think that you are on thin ice with this all-possible-regressions approach. You will get estimates that fit the best for your data, but that are unlikely to be reproducible. With 1.3M observations, you are also likely to find predictors that are statistically significant without having any business value.

I think that you are on thin ice with this all-possible-regressions approach. You will get estimates that fit the best for your data, but that are unlikely to be reproducible. With 1.3M observations, you are also likely to find predictors that are statistically significant without having any business value.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi ,

Thanks for the reply. I will look into the reference manual.

From your comment regarding 1.3M observations, I gather that logistics regression will fail to provide good predictive models. ( I am using train and validate datasets for model building)

1> Is logistics regression not suited for large datasets?

2> What methods would provide better predictive power for large datasets?

I appreciate all your help.

Regards,

Amit

Thanks for the reply. I will look into the reference manual.

From your comment regarding 1.3M observations, I gather that logistics regression will fail to provide good predictive models. ( I am using train and validate datasets for model building)

1> Is logistics regression not suited for large datasets?

2> What methods would provide better predictive power for large datasets?

I appreciate all your help.

Regards,

Amit

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

"I gather that logistics regression will fail to provide good predictive models." <-- that is not what I said at all. A good predictive model may have variables in it that predict well in a statistical sense (e.g. significant p-value), but are not useful for business decision making.

My other comment ("thin ice") referred to the classic problem in statistics of "multiple comparisons". If you do a billion analyses on 1.3 million observations, as you described in your initial post, you are going to get some models that predict well but are wrong in a business sense. It is not a problem with logistic regression, it is a problem with misapplication.

My other comment ("thin ice") referred to the classic problem in statistics of "multiple comparisons". If you do a billion analyses on 1.3 million observations, as you described in your initial post, you are going to get some models that predict well but are wrong in a business sense. It is not a problem with logistic regression, it is a problem with misapplication.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It makes sense. I will take it into consideration.

Thanks for all your help.

Regards,

Amit

Thanks for all your help.

Regards,

Amit

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

See the "Using SELECTION= with many variables" and "Large input data set" sections of this usage note for some ideas:

http://support.sas.com/kb/22607

http://support.sas.com/kb/22607

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.