turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Performance Question

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-07-2010 11:13 AM

Hi everyone,

I am trying to run a proc logistic regression model on around 1.3 million rows. I have been able to reduce the variables to 30.

If I run a all subset model with no interactions it is around 2^30 = 1,073,741,824 runs.

1> Is thier a way to find out how much time will it take?

2> Is their any technique to perform these many runs quicker?

Currently I am using stepwise, backward etc to run the model. Also I have PC SAS 9.2 (TS2M0) X64_ESRV platform

Thanks,

Amit

I am trying to run a proc logistic regression model on around 1.3 million rows. I have been able to reduce the variables to 30.

If I run a all subset model with no interactions it is around 2^30 = 1,073,741,824 runs.

1> Is thier a way to find out how much time will it take?

2> Is their any technique to perform these many runs quicker?

Currently I am using stepwise, backward etc to run the model. Also I have PC SAS 9.2 (TS2M0) X64_ESRV platform

Thanks,

Amit

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-07-2010 02:17 PM

There is a section on computational resources in the reference manual for LOGISTIC.

I think that you are on thin ice with this all-possible-regressions approach. You will get estimates that fit the best for your data, but that are unlikely to be reproducible. With 1.3M observations, you are also likely to find predictors that are statistically significant without having any business value.

I think that you are on thin ice with this all-possible-regressions approach. You will get estimates that fit the best for your data, but that are unlikely to be reproducible. With 1.3M observations, you are also likely to find predictors that are statistically significant without having any business value.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-07-2010 02:48 PM

Hi ,

Thanks for the reply. I will look into the reference manual.

From your comment regarding 1.3M observations, I gather that logistics regression will fail to provide good predictive models. ( I am using train and validate datasets for model building)

1> Is logistics regression not suited for large datasets?

2> What methods would provide better predictive power for large datasets?

I appreciate all your help.

Regards,

Amit

Thanks for the reply. I will look into the reference manual.

From your comment regarding 1.3M observations, I gather that logistics regression will fail to provide good predictive models. ( I am using train and validate datasets for model building)

1> Is logistics regression not suited for large datasets?

2> What methods would provide better predictive power for large datasets?

I appreciate all your help.

Regards,

Amit

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-07-2010 04:42 PM

"I gather that logistics regression will fail to provide good predictive models." <-- that is not what I said at all. A good predictive model may have variables in it that predict well in a statistical sense (e.g. significant p-value), but are not useful for business decision making.

My other comment ("thin ice") referred to the classic problem in statistics of "multiple comparisons". If you do a billion analyses on 1.3 million observations, as you described in your initial post, you are going to get some models that predict well but are wrong in a business sense. It is not a problem with logistic regression, it is a problem with misapplication.

My other comment ("thin ice") referred to the classic problem in statistics of "multiple comparisons". If you do a billion analyses on 1.3 million observations, as you described in your initial post, you are going to get some models that predict well but are wrong in a business sense. It is not a problem with logistic regression, it is a problem with misapplication.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-08-2010 10:40 AM

It makes sense. I will take it into consideration.

Thanks for all your help.

Regards,

Amit

Thanks for all your help.

Regards,

Amit

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

01-15-2010 02:11 PM

See the "Using SELECTION= with many variables" and "Large input data set" sections of this usage note for some ideas:

http://support.sas.com/kb/22607

http://support.sas.com/kb/22607