Re: When are there too many observations?

sasnewbie12 · Posted 11-18-2017 09:07 AM

Dear community,

I am running a study over five years with survey data, and the weighted number of patients included is about 30 million in total.

I will assess for a relationship between two variables. When does the number of observations become so large that everything starts to show significance (as I have heard)? I am subsetting the observations by a classification variable which includes about 20 subsets, and so the numbers will be smaller ultimately, but I would still like to know how I should interpret results with larger data in the millions, or does this issue happen when we go into the billions?

Thanks

art297 · Posted 11-18-2017 11:18 AM

I'm not a statistician, thus I can't provide a definitive answer to your question. In fact, I'm only responding for two reasons: (1) no one has responded yet after two hours and (2) this will insure that I get to see the other responses you will get.

A nice, easy to read blog provides part of your answer: https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwiW4qfbwMjXA...

However, when is big too big? I think it depends upon a number of factors, including type of analysis, practical significance, and power of the test. In answer to your question about millions vs billions, I think the number typically is much, much smaller (e.g., hundreds vs thousands), but depends on the combination of the variability in your data and how large a difference has to be that you would consider practically (not just statistically) significant.

Regardless, I completed my statistical studies too, too many years ago, thus am more interested in current thinking (my second reason for responding).

Art, CEO, AnalystFinder.com

PaigeMiller · Posted 11-18-2017 11:26 AM

As Art said, the answer depends on many things. It's impossible to give a definitive answer. However, if you have 30 million records, there's a good chance you are in that situation.

Just as a piece of information, people who do national political surveys (in the United States) will poll about 1500 people to get the desired results.

You probably want to come up with some estimate of practical significance if possible (for example, a difference of 5 units is of no importance to anyone). If the confidence intervals are an order of magnitude smaller than the practical significance, then you have reached the point where you have too large a sample.

--
Paige Miller

PGStats · Posted 11-19-2017 12:10 AM

Having lots of data is a blessing.

There are not many reasons I can think of for subsampling your data AFTER it is collected.

1) To reduce computing cost
2) To assess model performance (data is divided into training and test subsets)
3) To avoid pseudoreplication by avoiding units that might be correlated (in time, space, or otherwise)

4) To evaluate the small sample performance of an estimator

Otherwise, having more data simply shrinks your confidence intervals to the point where zero is not included.

PG

Ksharp · Posted 11-19-2017 05:02 AM

If there are too many obs, then everything looks good(significant) for statistical model. I would suggest randomly sample from it , and do Cross Validation Method to assess the statistical model.