03-19-2013 10:56 AM
Hi, does anybody know a way to calculate the average of many confidence intervals in SAS... for example I have 1000 accounts each with an Estimate and a 95%-Upper and 95%-Lower Confidence Interval value... I want to calculate the average Estimate , the average 95%-Upper and average 95%-Lower (as well as the standard deviation) for that 1000 sample. I don't suspect this is something easy, but hoping... Thanks
03-19-2013 11:00 AM
Assuming your data has variables that contain what you are calling estimate and the upper and lower bounds then I think
proc means data=yourdataset;
var estimate (and the names of your existing bounds variables);
should provide mean and standard deviation for each of those.
03-19-2013 11:12 AM
While that's mathematically correct, I'm not sure its statistically correct.
Podarum, you'd probably need to explain more about what you're trying to do. For example, if it was a simulation, and you have 1000 estimates, the average could be your estimate and the 5 and 95th percentile of the estimates could be your confidence interval. It depends on what you're doing.
03-19-2013 11:22 AM
Reeza, I have 1000 home estimated prices in region=A .. and each of these homes I also have the 95%-U, 95%-L Confidence Interval for each home's estimate. I am trying to get a general 95%-U and 95%-L for those 1000 homes in region=A. So that I can calculate the Type I error of being between my comfort level of Estimate +/- 10% and the Confidence Interval for region=A.
So my final result would be a number (perhaps a risk rating of some sort) of being between my Comfort Level (+/-10% from the Estimate) and the +/-95% Confidence Level.. Thanks
03-19-2013 11:35 AM
I think Reeza is on to something important here. Confidence bounds are built on standard error of the mean estimates and a t value dependent on the degrees of freedom. The confidence bounds aren't additive. Neither are standard deviations. Nor are variances. But the sums of squares that these are all derived from ARE additive. So it is a matter of calculating the sums of squares for error for each account, by back transforming. This will require knowing the degrees of freedom associated with each estimate. If you don't have that, then this method will not work. I haven't checked the code, so bear with me.
This assumes that ucb is the upper bound, estimate is the mean, and you have the df for each. Then calculate the pooled estimate of the mean. The pooled variance is the sum of all the sse's, with the sum of the df's as the degrees of freedom. From the pooled variance, the pooled confidence bounds can be calculated.
03-19-2013 12:58 PM
Does all this change if we know that the estimated prices are dependent on eachother ?..they are home prices ina certain region
03-19-2013 01:03 PM
I think so, since you don't have independence between houses.
Where does your 95% CI for each house come from? I think you're mixing terms here that may not have the standard definition that people are assuming.
My suggestion would be to simulate what would happen under certain assumptions and then use that to figure out what you need. That being said, you can do something mathematically, but my math formula derivation days are behind me (hopefully).
03-20-2013 08:48 AM
It sounds, then, like my back-calculation approach isn't going to work very well, as I would guess that you don't have the degrees of freedom for each CI.
Maybe you could use PROC QUANTREG, with a weight set equal to the inverse of the length of the CI. This would give you an estimate of the 5th, 50th and 95th percentile, I think:
proc quantreg ;
model estimate= /quantile=0.05 0.5 0.95;
weight = invcilength;
where invcilength=1/(ucb-lcb), from your existing dataset.
The output summary statistics will give Q1, median, Q3, mean, sd, and MAD deviation, and the parameter estimates would give the percentiles, which would be the overall confidence interval on the population, rather than the CI on the population of means. This does not address the clustering issue, however. If you assume that the clustering is a fixed effect, you could add a CLASS statement, and include that variable in the model.
03-19-2013 11:22 AM
I agree with Reeza. If you are interested in the distribution of means across you accounts, you should consider
the "mean of the means" as the estimate of the "average mean" and some statistic (standard error, IQR, CI) that indicates the spread of the distribution of the means.
proc means data=yourdataset mean P5 P95 Std;
03-19-2013 03:34 PM
Maybe we should ask you WHY you want to do this. For example, are you trying to find outliers? Cluster similar neighborhoods? Understand variation in prices?
Depending on what you are trying to find out, it might help to visualize the data. If you know the sample size for each neighborhood, you might find Funnel Plots ( Funnel plots: An alternative to ranking - The DO Loop) or Analysis of Means plots ( Comparing funnel plots to an Analysis of Means plot - The DO Loop) useful for comparing the variance of prices in neighborhoods.