- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I suppose the answer depends on how fast your computer is and how long you want to wait for the answer. I would go ahead an do the regression on the entire population; in other words the decision has nothing to do with statistics.
10,000 records doesn't sound big to me at all, I'm sure I have done bigger.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
One thing to note is that most of the regression procedures by default will exclude from analysis any record that has a missing value for any of the variables on the model statement. So the regression may actually use many fewer records then you expect if you have many missing values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ps8813 wrote:
If we can process full population then why we take sample?
1. Because historically that wasn't possible.
2. Because when you have a lot of data your working on different principles, everything will be statistically significant at the point even if it's not significant due to effect size.
3. Because having full population data is rare. So if I measure the hoof length of all the zebras in my zoo, that isn't the full population, it's a sample, but it's my full population. terminology is important here.
If you're in a rare case when you do have the full population, and it's a manageable size, ie 10,000 then going ahead and using the full population makes sense.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
It depends on the objective of the study. If you are building predictive models then it is usually suggested to split the data into training and test data sets. Training data is used to train the model while test data is used to see how stable the model is. On the other hand, if goal is to draw inferences then full data set can be used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Computation speed or available memory is not an issue anymore. But remember that sampling is usually performed before measurement and measurement often has a high cost per unit. That's why it is often preferable to measure only a sample of the population.
Once we have the data (sample or full population), sampling is mostly useful for validating the structure or the performance of statistical models. Also, some statistical estimation methods (e.g. bootstrap) rely entirely on repeated sampling.