06-27-2011 04:39 AM
I have a very massive data set with over 5 million records and 30 variables. I plan to do a logistic regression analysis. Given the size of the dataset, I am thinking if I still need to split the data into calculation sample vs. validation sample.
Maybe the validation sample is not needed. Because the sample size is very large, the sampling error should be very small.
06-27-2011 09:55 AM
Have you looked at
You can probably get this to run using the MULTIPASS option, though it may take days. We routinely do logistic regressions on multi-million observation data sets.
Take a simple random sample, run it to make sure the model and result look like they should and then run the whole thing. Depending on the capacity of your computer, it could take days to weeks to run.
We are moving to a SAS Grid Computing environment to ease the bottlenecks, but have taken the above approach for a number of years.
Another approach would be to use the SEMMA approach of SAS Enterprise Miner. You don't have to have the EM product to use the principles. The things that I would worry about here is if you have categorical predictors that are both rare and highly influential.
06-27-2011 12:40 PM
Let me see if I can turn your thinking around here. You have a very large data set. Because it is large, you should have tight confidence limits on predicted probabilities obtained from a logistic regression model fitted to these data. Now, suppose that you had just 4 million records instead of 5 million records. Would you still say that you would expect tight confidence limits on predicted probabilities? Probably so. (If not, what are the limits on a large sample size?)
Now, if 4 million records are going to produce tight confidence limits on predicted probabilities, then what would it hurt you to hold out a million records from the estimation set and use those 1 million observations for subsequent evaluation. It would seem that you have every opportunity in the world to obtain and subsequently test a model. What benefit would there be to you in using 5 million records for model estimation without model validation vs using 4 million records for model estimation followed by model evaluation?
06-30-2011 05:17 AM
Hi Dale, thanks for reply.
I am trying to interpret your answer. Do you mean that it is not necessary in my case when the sample size is extremely large? Because the size is large, the predicted confidence interval should be small. This makes validation unnecessary. Am I right?
06-30-2011 12:15 PM
No! Quite the opposite! I am saying that with the volume of data that you have, there is every reason to hold out a validation sample of, say, 1 million records. You would still have a very large sample (4 million records) for estimation. Confidence intervals of predicted probabilities will not be much larger for a model constructed from 4 million observations compared to a model constructed from 5 million observations. Thus, you don't lose much in the way of model estimation and you gain much by having a validation sample where you can test your model.