Re: Generating a large sample from a small sample, sum variable and re...

righcoastmike · Posted 06-28-2018 02:24 PM

Hi All, this is an update on an earlier thread I started here

I'm working on the following problem:

A real estate management company has 9500 units (apartments) that they are responsible for. Because the company only has 1 inspector, and the apartments are spread out, they only have the resources to physically check 35 apartments a year (the inspector is pretty slow apparently). They are hoping to use this sample to estimate how much $$ they should budget for maintenance over all 9500 apartments with a 95% confidence interval. The sample of 35 units can be considered a simple random sample

The dataset looks like this:

Data repaircost;
Input Unit Repair_cost;
datalines;
1 10277.00
2 33615.00
3 23442.00
4 11220.00
5 41321.00
6 40801.00
7 20896.00
8 44753.00
9 28659.00
10 19753.00
11 28760.00
12 24537.00
13 20536.00
14 20959.00
15 5693.00
16 8290.00
17 28715.00
18 41550.00
19 18459.00
20 49197.00
21 28955.00
22 46149.00
23 25273.00
24 45867.00
25 24716.00
26 43519.00
27 27884.00
28 37714.00
29 8001.00
30 42151.00
31 43197.00
32 27245.00
33 31736.00
34 9503.00
35 14946.00
;
run;

There have been a number of different solutions presented from calculating the mean cost and upper/lower CI for one apartment and multiplying all those numbers by 9500, which will give me a ball-park but is a little too "back of the napkin" I think. (feel free to correct me on that, I would love it if that was the solution)

The most recent suggestion I got was to bootstrap the sample of 35 with replacement to create a new sample of 9500 and sum the costs. I would do this say 10,000 times then order the sums in ascending order and the 250th and 9750th values would represent a 95 percentile confidence interval.

Any help on how to expand my sample population from 35 to 9500, get a total sum of costs and then do it another 9,999 times would be much appreciated!

Thanks so much, as always I'm amazed at how supportive this community is.

*other potential solutions are also more than welcome*

Mike

FreelanceReinh · Posted 06-28-2018 03:47 PM

Hi @righcoastmike,

Regarding bootstrap methods with SAS I found this paper (by David Cassell) interesting.

I had followed the other thread a bit and was actually quite confident about the quality of the suggested CI (having the classic reference "Sampling Techniques" by W. G. Cochran at hand). I think, the major risk would be to have some extreme outliers in the population which may or may not occur in the sample. So, another approach would be to simulate such populations.

If I had to examine the project, I would also scrutinize the assumption that the sample "can be considered a simple random sample" or if, for example, easily accessible apartments had a higher probability of being included in the sample.

righcoastmike · Posted 06-28-2018 03:53 PM

Thanks Freelance. I'll take a look at that paper, i'm sure it will be helpful.

After some investigation and a helpful stats person I managed to get the code to do the bootstrapping in R (I know not SAS but beggars can't be choosers) and it boosted my confidence in the numbers.

Here's how they compare:

The "calculate the mean for 1 unit and multiply by total number of units" method gave me:

Mean total repair cost 260 392 580.66
lower CL 221 214 943
Upper CL 299 570 217

While the bootstrapping method came up with this:

Mean total repair cost = 260 400 000
lower CL 258 033 216
Upper CL 262 853 966

So much tighter CI for the bootstrapping method, but in this case conservative is OK.

Thanks again for thinking through this with me everyone. It's been really interesting.

Mike

Reeza · Posted 06-28-2018 03:52 PM

Do you have any data on which apartments were renovated when and the timings? There could be some survival type event to determine if an event will happen/when and then a second stage to determine how much would be impacted. Or a logistic regression to predict the probability of an event.

Two Stage Regression is what this is referred to.

The other option would be as you suggested, which is a simulation basically. If you want to simulate data, I would strongly suggest reading ‘Don’t be Loopy’ paper by David Cassell.

Reeza · Posted 06-28-2018 03:54 PM

Age of buildings is probably a big factor as well.

righcoastmike · Posted 06-28-2018 03:56 PM

Hi Reeza,

I'm not sure how the exact numbers were calculated, I've just got the totals. I agree though, there are definitely a bunch of different variables that need to be taken into account. I think the bootstrapping method is good enough for now. Thanks so much for all your help!

Mike

Reeza · Posted 06-28-2018 04:11 PM

I think the bootstrap only tells you that the data is normally distributed, I don’t think it gets rid of any of the initial concerns with the methodology. I would wait for PGStats or Rick to comment though.

Generating a large sample from a small sample, sum variable and repeat.....10,000 times

Re: Generating a large sample from a small sample, sum variable and repeat.....10,000 times

Re: Generating a large sample from a small sample, sum variable and repeat.....10,000 times

Re: Generating a large sample from a small sample, sum variable and repeat.....10,000 times

Re: Generating a large sample from a small sample, sum variable and repeat.....10,000 times

Re: Generating a large sample from a small sample, sum variable and repeat.....10,000 times

Re: Generating a large sample from a small sample, sum variable and repeat.....10,000 times

Classroom Training Available!