Statistical Procedures

Ruth · Posted 06-27-2011 04:39 AM

Hi,

I have a very massive data set with over 5 million records and 30 variables. I plan to do a logistic regression analysis. Given the size of the dataset, I am thinking if I still need to split the data into calculation sample vs. validation sample.

Maybe the validation sample is not needed. Because the sample size is very large, the sampling error should be very small.

Doc_Duke · Posted 06-27-2011 09:55 AM

Have you looked at

http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_logistic_sec... ?

You can probably get this to run using the MULTIPASS option, though it may take days. We routinely do logistic regressions on multi-million observation data sets.

Take a simple random sample, run it to make sure the model and result look like they should and then run the whole thing. Depending on the capacity of your computer, it could take days to weeks to run.

We are moving to a SAS Grid Computing environment to ease the bottlenecks, but have taken the above approach for a number of years.

Another approach would be to use the SEMMA approach of SAS Enterprise Miner. You don't have to have the EM product to use the principles. The things that I would worry about here is if you have categorical predictors that are both rare and highly influential.

Doc Muhlbaier

Duke

Dale · Posted 06-27-2011 12:40 PM

Ruth,

Let me see if I can turn your thinking around here. You have a very large data set. Because it is large, you should have tight confidence limits on predicted probabilities obtained from a logistic regression model fitted to these data. Now, suppose that you had just 4 million records instead of 5 million records. Would you still say that you would expect tight confidence limits on predicted probabilities? Probably so. (If not, what are the limits on a large sample size?)

Now, if 4 million records are going to produce tight confidence limits on predicted probabilities, then what would it hurt you to hold out a million records from the estimation set and use those 1 million observations for subsequent evaluation. It would seem that you have every opportunity in the world to obtain and subsequently test a model. What benefit would there be to you in using 5 million records for model estimation without model validation vs using 4 million records for model estimation followed by model evaluation?

Ruth · Posted 06-30-2011 05:17 AM

Hi Dale, thanks for reply.

I am trying to interpret your answer. Do you mean that it is not necessary in my case when the sample size is extremely large? Because the size is large, the predicted confidence interval should be small. This makes validation unnecessary. Am I right?

Dale · Posted 06-30-2011 12:15 PM

No! Quite the opposite! I am saying that with the volume of data that you have, there is every reason to hold out a validation sample of, say, 1 million records. You would still have a very large sample (4 million records) for estimation. Confidence intervals of predicted probabilities will not be much larger for a model constructed from 4 million observations compared to a model constructed from 5 million observations. Thus, you don't lose much in the way of model estimation and you gain much by having a validation sample where you can test your model.

Statistical Procedures

A data set with over 5million records and 30 variables

A data set with over 5million records and 30 variables

Re: A data set with over 5million records and 30 variables

Re: A data set with over 5million records and 30 variables

Re: A data set with over 5million records and 30 variables

Assign variable names to a data set based on variable names of another...

SAS Viya 3.5: Remove Duplicate Records in SAS Data Studio

Improving Manufacturing Product Quality With Bayesian Computation in S...

help with sas data set

Set a variable in data set

Follow Us

What is...

Statistical Procedures

Our biggest data and AI event of the year.

Follow Us

What is...