Statistical Procedures

Ps8813 · Posted 08-31-2017 03:34 PM

If I have a population data of 10,000 observation and I want to perform linear regression, then still I need to take sample or I can perform it on full population? If we can process full population then why we take sample?

PaigeMiller · Posted 08-31-2017 03:38 PM

I suppose the answer depends on how fast your computer is and how long you want to wait for the answer. I would go ahead an do the regression on the entire population; in other words the decision has nothing to do with statistics.

10,000 records doesn't sound big to me at all, I'm sure I have done bigger.

--
Paige Miller

ballardw · Posted 08-31-2017 04:18 PM

One thing to note is that most of the regression procedures by default will exclude from analysis any record that has a missing value for any of the variables on the model statement. So the regression may actually use many fewer records then you expect if you have many missing values.

Reeza · Posted 08-31-2017 04:34 PM

@Ps8813 wrote:
If we can process full population then why we take sample?

1. Because historically that wasn't possible.

2. Because when you have a lot of data your working on different principles, everything will be statistically significant at the point even if it's not significant due to effect size.

3. Because having full population data is rare. So if I measure the hoof length of all the zebras in my zoo, that isn't the full population, it's a sample, but it's my full population. terminology is important here.

If you're in a rare case when you do have the full population, and it's a manageable size, ie 10,000 then going ahead and using the full population makes sense.

stat_sas · Posted 08-31-2017 07:33 PM

Hi,

It depends on the objective of the study. If you are building predictive models then it is usually suggested to split the data into training and test data sets. Training data is used to train the model while test data is used to see how stable the model is. On the other hand, if goal is to draw inferences then full data set can be used.

PGStats · Posted 09-01-2017 12:44 AM

Computation speed or available memory is not an issue anymore. But remember that sampling is usually performed before measurement and measurement often has a high cost per unit. That's why it is often preferable to measure only a sample of the population.
Once we have the data (sample or full population), sampling is mostly useful for validating the structure or the performance of statistical models. Also, some statistical estimation methods (e.g. bootstrap) rely entirely on repeated sampling.

PG

Statistical Procedures

Sample for linear regression

Re: Sample for linear regression

Re: Sample for linear regression

Re: Sample for linear regression

Re: Sample for linear regression

Re: Sample for linear regression

선형회귀(Linear Regression)

Linear Regression: SST = SSR + SSE

Bayesian Linear Regression with Standardized Covariates

Linear regression of one sample pre- /post- data

Automatic Linearization Using the OPTMODEL Procedure: Least Absolute D...

Follow Us

What is...

Statistical Procedures

Register Today!

Follow Us

What is...