Solved: Re: Inferring data about a large population from a small sample

righcoastmike · Posted 06-27-2018 09:34 PM

HI All,

I have a question that might be as much about stats as it is about SAS programming. I'm hoping that you folks can help.

I have a sample of 35 records from a population of 9635 that looks like this:

Data have;
Input Unit Repair_cost;
datalines:
1	10,277.00
2	33,615.00
3	23,442.00
4	11,220.00
5	41,321.00
6	40,801.00
7	20,896.00
8	44,753.00
9	28,659.00
10	19,753.00
11	28,760.00
12	24,537.00
13	20,536.00
14	20,959.00
15	5,693.00
16	8,290.00
17	28,715.00
18	41,550.00
19	18,459.00
20	49,197.00
21	28,955.00
22	46,149.00
23	25,273.00
24	45,867.00
25	24,716.00
26	43,519.00
27	27,884.00
28	37,714.00
29	8,001.00
30	42,151.00
31	43,197.00
32	27,245.00
33	31,736.00
34	9,503.00
35	14,946.00
;
run;

I figure I can calculate the SD and 95% confidence limits for the sample by using:

ods select BasicIntervals;
proc univariate data=have cibasic;
   var Repair_cost;
run;

That should give me the mean repair cost and 95% confidence interval for an individual unit. My question is, can I then multiply the mean, upper and lower limits by the total population (9635) to get an expected total repair cost and associated confidence limits. It makes intuitive sense to me, but I've found that in stats, my intuition isn't always correct.

If I can't do it this way, can someone suggest the best way to get a predicted total repair cost and associated confidence interval for the entire population of 9635 based on the sample of 35 I have above?

any help is much appreciated.

Thanks so much

Mike

Reeza · Posted 06-27-2018 11:33 PM

If it's a simple random sample you can use the method you initially suggested.

If it was a sample where the machines do not reflect your population of machines and each one has a specific weight attached to it to match the total population then that would be weighted analysis.

@righcoastmike wrote:

******UPDATE********* I think proc surveymeans might be what I am looking for, but I'm still not sure how to get an expected total repair costs w. 95% confidence intervals for the entire population (9363 Units), based on the data in the sample population (35 units).

View solution in original post

Reeza · Posted 06-27-2018 09:48 PM

Is there a guarantee that all units need to be repaired at some point? This is what I would call a back of the napkin type estimate....

@righcoastmike wrote:

HI All,

I have a question that might be as much about stats as it is about SAS programming. I'm hoping that you folks can help.

I have a sample of 35 records from a population of 9635 that looks like this:
Data have;
Input Unit Repair_cost;
datalines:
1	10,277.00
2	33,615.00
3	23,442.00
4	11,220.00
5	41,321.00
6	40,801.00
7	20,896.00
8	44,753.00
9	28,659.00
10	19,753.00
11	28,760.00
12	24,537.00
13	20,536.00
14	20,959.00
15	5,693.00
16	8,290.00
17	28,715.00
18	41,550.00
19	18,459.00
20	49,197.00
21	28,955.00
22	46,149.00
23	25,273.00
24	45,867.00
25	24,716.00
26	43,519.00
27	27,884.00
28	37,714.00
29	8,001.00
30	42,151.00
31	43,197.00
32	27,245.00
33	31,736.00
34	9,503.00
35	14,946.00
;
run;
I figure I can calculate the SD and 95% confidence limits for the sample by using:
ods select BasicIntervals;
proc univariate data=have cibasic;
   var Repair_cost;
run;
That should give me the mean repair cost and 95% confidence interval for an individual unit. My question is, can I then multiply the mean, upper and lower limits by the total population (9635) to get an expected total repair cost and associated confidence limits. It makes intuitive sense to me, but I've found that in stats, my intuition isn't always correct.

If I can't do it this way, can someone suggest the best way to get a predicted total repair cost and associated confidence interval for the entire population of 9635 based on the sample of 35 I have above?

any help is much appreciated.

Thanks so much

Mike

righcoastmike · Posted 06-27-2018 10:08 PM

These numbers are the expected costs for each unit in 1 year. so yes, i suppose that they could be considered "guaranteed". We have this data projected out for 10 years (so 10 identical tables to the one I posted for every year from 2018-2027) Basically, assuming that the estimates are correct, we are looking for an estimated total repair cost in each year, as well as total over 10 years with a 95% confidence interval.

not sure if that helps of confuses, but thanks for having a think about this with me.

Mike

Ksharp · Posted 06-27-2018 10:16 PM

No. You can't . It depends on how it sample from population .

Or Calling @Rick_SAS . Maybe he can shed a light .

righcoastmike · Posted 06-27-2018 10:23 PM

If it helps, my sample should be considered as a simple random sample.

Ksharp · Posted 06-27-2018 10:37 PM

If I was right, then your estimator of sample is BLUE. i.e. the mean of sample is almost the same as the population. also for mean's CL .

righcoastmike · Posted 06-27-2018 10:40 PM

******UPDATE********* I think proc surveymeans might be what I am looking for, but I'm still not sure how to get an expected total repair costs w. 95% confidence intervals for the entire population (9363 Units), based on the data in the sample population (35 units).

Reeza · Posted 06-27-2018 11:33 PM

If it's a simple random sample you can use the method you initially suggested.

If it was a sample where the machines do not reflect your population of machines and each one has a specific weight attached to it to match the total population then that would be weighted analysis.

@righcoastmike wrote:

******UPDATE********* I think proc surveymeans might be what I am looking for, but I'm still not sure how to get an expected total repair costs w. 95% confidence intervals for the entire population (9363 Units), based on the data in the sample population (35 units).

righcoastmike · Posted 06-28-2018 07:28 AM

Thanks Reeza, much appreciated.

Mike

Rick_SAS · Posted 06-28-2018 07:48 AM

This is an interesting question. I think the confidence interval will depend on the assumed distribution of the prices. For example, the sum of IID exponential random variables has a gamma distribution. The sum of IID normal variables is normal.

Assuming a simple random sample, the expected sum is N*XBar, where XBar is the sample mean and N=9635. However, I don't think multiplying the lower/upper limits by N gives the correct CI. I think that interval is too conservative (that is, wider than it needs to me). If you want a ballpark figure, you can use it.

righcoastmike · Posted 06-28-2018 07:52 AM

Thanks Rick,

I would rather be too conservative as opposed to not, and for now I think a ballpark would work. At this point though, I'm just curious about how one would go about calculating the CI for the total properly. I'll keep looking and post a response here if I figure anything out.

Reeza · Posted 06-28-2018 11:10 AM

I think it becomes a prediction interval, not a confidence interval and that would be wider than the confidence interval.

Classroom Training Available!