BookmarkSubscribeRSS Feed
righcoastmike
Quartz | Level 8

Hi All, this is an update on an earlier thread I started here

 

I'm working on the following problem: 

 

A real estate management company has 9500 units (apartments) that they are responsible for. Because the company only has 1 inspector, and the apartments are spread out, they only have the resources to physically check 35 apartments a year (the inspector is pretty slow apparently). They are hoping to use this sample to estimate how much $$ they should budget for maintenance over all 9500 apartments with a 95% confidence interval. The sample of 35 units can be considered a simple random sample

 

The dataset looks like this: 

Data repaircost;
Input Unit Repair_cost;
datalines;
1 10277.00
2 33615.00
3 23442.00
4 11220.00
5 41321.00
6 40801.00
7 20896.00
8 44753.00
9 28659.00
10 19753.00
11 28760.00
12 24537.00
13 20536.00
14 20959.00
15 5693.00
16 8290.00
17 28715.00
18 41550.00
19 18459.00
20 49197.00
21 28955.00
22 46149.00
23 25273.00
24 45867.00
25 24716.00
26 43519.00
27 27884.00
28 37714.00
29 8001.00
30 42151.00
31 43197.00
32 27245.00
33 31736.00
34 9503.00
35 14946.00
;
run;

There have been a number of different solutions presented from calculating the mean cost and upper/lower CI for one apartment and multiplying all those numbers by 9500, which will give me a ball-park but is a little too "back of the napkin" I think. (feel free to correct me on that, I would love it if that was the solution) 

 

The most recent suggestion I got was to  bootstrap the sample of 35 with replacement to create a new sample of 9500 and sum the costs. I would do this say 10,000 times then order the sums in ascending order and the 250th and 9750th values would represent a 95 percentile confidence interval. 

 

Any help on how to expand my sample population from 35 to 9500, get a total sum of costs and then do it another 9,999 times would be much appreciated! 

 

Thanks so much, as always I'm amazed at how supportive this community is. 

 

*other potential solutions are also more than welcome*

 

Mike

6 REPLIES 6
FreelanceReinh
Jade | Level 19

Hi @righcoastmike,

 

Regarding bootstrap methods with SAS I found this paper (by David Cassell) interesting.

 

I had followed the other thread a bit and was actually quite confident about the quality of the suggested CI (having the classic reference "Sampling Techniques" by W. G. Cochran at hand). I think, the major risk would be to have some extreme outliers in the population which may or may not occur in the sample. So, another approach would be to simulate such populations.

 

If I had to examine the project, I would also scrutinize the assumption that the sample "can be considered a simple random sample" or if, for example, easily accessible apartments had a higher probability of being included in the sample.

 

 

righcoastmike
Quartz | Level 8

Thanks Freelance.  I'll take a look at that paper, i'm sure it will be helpful. 

 

After some investigation and a helpful stats person I managed to get the code to do the bootstrapping in R (I know not SAS but beggars can't be choosers) and it boosted my confidence in the numbers. 

 

Here's how they compare: 

 

The "calculate the mean for 1 unit and multiply by total number of units" method gave me: 

Mean total repair cost 260 392 580.66
lower CL 221 214 943
Upper CL 299 570 217

 

While the bootstrapping method came up with this: 

 

Mean total repair cost = 260 400 000
lower CL 258 033 216
Upper CL 262 853 966

 

So much tighter CI for the bootstrapping method, but in this case conservative is OK. 

 

Thanks again for thinking through this with me everyone. It's been really interesting. 

 

Mike 

 

 

 

Reeza
Super User

Do you have any data on which apartments were renovated when and the timings? There could be some survival type event to determine if an event will happen/when and then a second stage to determine how much would be impacted. Or a logistic regression to predict the probability of an event. 

 

Two Stage Regression is what this is referred to. 

 

The other option would be as you suggested, which is a simulation basically. If you want to simulate data, I would strongly suggest reading ‘Don’t be Loopy’ paper by David Cassell. 

 

 

Reeza
Super User
Age of buildings is probably a big factor as well.
righcoastmike
Quartz | Level 8

Hi Reeza, 

 

I'm not sure how the exact numbers were calculated, I've just got the totals. I agree though, there are definitely a bunch of different variables that need to be taken into account. I think the bootstrapping method is good enough for now. Thanks so much for all your help! 

 

Mike 

 

Reeza
Super User
I think the bootstrap only tells you that the data is normally distributed, I don’t think it gets rid of any of the initial concerns with the methodology. I would wait for PGStats or Rick to comment though.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 934 views
  • 2 likes
  • 3 in conversation