Dear community,
I am running a study over five years with survey data, and the weighted number of patients included is about 30 million in total.
I will assess for a relationship between two variables. When does the number of observations become so large that everything starts to show significance (as I have heard)? I am subsetting the observations by a classification variable which includes about 20 subsets, and so the numbers will be smaller ultimately, but I would still like to know how I should interpret results with larger data in the millions, or does this issue happen when we go into the billions?
Thanks
I'm not a statistician, thus I can't provide a definitive answer to your question. In fact, I'm only responding for two reasons: (1) no one has responded yet after two hours and (2) this will insure that I get to see the other responses you will get.
A nice, easy to read blog provides part of your answer: https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwiW4qfbwMjXA...
However, when is big too big? I think it depends upon a number of factors, including type of analysis, practical significance, and power of the test. In answer to your question about millions vs billions, I think the number typically is much, much smaller (e.g., hundreds vs thousands), but depends on the combination of the variability in your data and how large a difference has to be that you would consider practically (not just statistically) significant.
Regardless, I completed my statistical studies too, too many years ago, thus am more interested in current thinking (my second reason for responding).
Art, CEO, AnalystFinder.com
As Art said, the answer depends on many things. It's impossible to give a definitive answer. However, if you have 30 million records, there's a good chance you are in that situation.
Just as a piece of information, people who do national political surveys (in the United States) will poll about 1500 people to get the desired results.
You probably want to come up with some estimate of practical significance if possible (for example, a difference of 5 units is of no importance to anyone). If the confidence intervals are an order of magnitude smaller than the practical significance, then you have reached the point where you have too large a sample.
Having lots of data is a blessing.
There are not many reasons I can think of for subsampling your data AFTER it is collected.
1) To reduce computing cost
2) To assess model performance (data is divided into training and test subsets)
3) To avoid pseudoreplication by avoiding units that might be correlated (in time, space, or otherwise)
4) To evaluate the small sample performance of an estimator
Otherwise, having more data simply shrinks your confidence intervals to the point where zero is not included.
If there are too many obs, then everything looks good(significant) for statistical model. I would suggest randomly sample from it , and do Cross Validation Method to assess the statistical model.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.