BookmarkSubscribeRSS Feed
Ps8813
Fluorite | Level 6
If I have a population data of 10,000 observation and I want to perform linear regression, then still I need to take sample or I can perform it on full population? If we can process full population then why we take sample?
5 REPLIES 5
PaigeMiller
Diamond | Level 26

I suppose the answer depends on how fast your computer is and how long you want to wait for the answer. I would go ahead an do the regression on the entire population; in other words the decision has nothing to do with statistics.

 

10,000 records doesn't sound big to me at all, I'm sure I have done bigger.

--
Paige Miller
ballardw
Super User

One thing to note is that most of the regression procedures by default will exclude from analysis any record that has a missing value for any of the variables on the model statement. So the regression may actually use many fewer records then you expect if you have many missing values.

Reeza
Super User

@Ps8813 wrote:
If we can process full population then why we take sample?

1. Because historically that wasn't possible.

2. Because when you have a lot of data your working on different principles, everything will be statistically significant at the point even if it's not significant due to effect size. 

3. Because having full population data is rare. So if I measure the hoof length of all the zebras in my zoo, that isn't the full population, it's a sample, but it's my full population. terminology is important here.

 

If you're in a rare case when you do have the full population, and it's a manageable size, ie 10,000 then going ahead and using the full population makes sense. 

stat_sas
Ammonite | Level 13

Hi,

 

It depends on the objective of the study. If you are building predictive models then it is usually suggested to split the data into training and test data sets. Training data is used to train the model while test data is used to see how stable the model is. On the other hand, if goal is to draw inferences then full data set can be used.

PGStats
Opal | Level 21

Computation speed or available memory is not an issue anymore. But remember that sampling is usually performed before measurement and measurement often has a high cost per unit. That's why it is often preferable to measure only a sample of the population.
Once we have the data (sample or full population), sampling is mostly useful for validating the structure or the performance of statistical models. Also, some statistical estimation methods (e.g. bootstrap) rely entirely on repeated sampling.

PG

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1374 views
  • 0 likes
  • 6 in conversation