Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- A data set with over 5million records and 30 variables

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-27-2011 04:39 AM
(3157 views)

Hi,

I have a very massive data set with over 5 million records and 30 variables. I plan to do a logistic regression analysis. Given the size of the dataset, I am thinking if I still need to split the data into calculation sample vs. validation sample.

Maybe the validation sample is not needed. Because the sample size is very large, the sampling error should be very small.

4 REPLIES 4

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Have you looked at

You can probably get this to run using the MULTIPASS option, though it may take days. We routinely do logistic regressions on multi-million observation data sets.

Take a simple random sample, run it to make sure the model and result look like they should and then run the whole thing. Depending on the capacity of your computer, it could take days to weeks to run.

We are moving to a SAS Grid Computing environment to ease the bottlenecks, but have taken the above approach for a number of years.

Another approach would be to use the SEMMA approach of SAS Enterprise Miner. You don't have to have the EM product to use the principles. The things that I would worry about here is if you have categorical predictors that are both rare and highly influential.

Doc Muhlbaier

Duke

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Ruth,

Let me see if I can turn your thinking around here. You have a very large data set. Because it is large, you should have tight confidence limits on predicted probabilities obtained from a logistic regression model fitted to these data. Now, suppose that you had just 4 million records instead of 5 million records. Would you still say that you would expect tight confidence limits on predicted probabilities? Probably so. (If not, what are the limits on a large sample size?)

Now, if 4 million records are going to produce tight confidence limits on predicted probabilities, then what would it hurt you to hold out a million records from the estimation set and use those 1 million observations for subsequent evaluation. It would seem that you have every opportunity in the world to obtain and subsequently test a model. What benefit would there be to you in using 5 million records for model estimation without model validation vs using 4 million records for model estimation followed by model evaluation?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Dale, thanks for reply.

I am trying to interpret your answer. Do you mean that it is not necessary in my case when the sample size is extremely large? Because the size is large, the predicted confidence interval should be small. This makes validation unnecessary. Am I right?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.