03-16-2015 04:00 PM
Hello everyone. I am having a huge problem with a conceptual problem that I came up with. The best way I can explain is through an example, please let me know if anyone cannot follow or has any follow up questions.
Say a company has a data distribution that is HIGHLY skewed.... Something similar to an exponential or lognormal only more extreme. Now pretend the distribution is so skewed that the Mean of the distribution is higher than the 99% Percentile of the distribution. (Aka 1-2 EXTREME higher values caused the mean to be extremely high compared to the rest of distribution).
By definition, if this distribution was used to forecast a future value (aka a random sample from the distribution) would it be true that mean would NOT be in the 95% Prediction interval?
In my brain, a 95% prediction interval is a range that 95% of all future values will fall between. For any distribution this should exactly equal the .025 Percentile on the lower bound, and the .975 percentile on the upper bound... If the mean is higher than the .975 Percentile, then the mean would not be within the '95% prediction interval'.
Am I thinking of this incorrectly? It seems strange to report a forecast as
Mean Forecasted Value: 6,000,0000
95% Prediction Interval: [400,5000].
How this came up:
I work for a company that uses a 6 month moving average of a given metric (Days on market) to 'forecast' the next sold properties Days on market. No tests of residual normality (or any statistical tests at all really) were performed before this test was used. I have now been tasked with adding "prediction intervals" to the forecast (using the 6 month mean prediction value).
There are formulas that one can use for prediction intervals, but they are only valid on normally distributed residuals. The data I have has FAR from normally distributed residuals (it is closer to log-normal). I realized I was not sure if I should do a monte carlo simtulation and randomly sample from the error distribution and add it to the "estimate", and then get the .025 and the .975 percentile from this distribution for the forecast interval... I then realized that by defiinition this interval didn't have to contain the mean... Which led me to the more extreme (and simplied) example given above.
I am just curious if I am way off base in my thinking. Thank you for your time, and please let me know if I can clarify my question further!
03-20-2015 10:00 AM
A few observations:
1. The definition of prediction interval says nothing about the location of the mean. Therefore, while not common, it does not need to contain the mean. You will have some explaining though why that is the case. See also 2.
2. Chebyshev's inequality restricts the mass of probability that a random variable is k standard deviations away from the mean to be less than the inverse of the square of k. That means that, in practice, a case such as you are describing is not very likely to happen. You would have to have a positive mass of probability to a range of values that are extremely far away on the right tail. Chebyshev's inequality is a rather loose inequality for most common distributions that have a tail that is declining fast.
3. In practice cases where the data have a skewed distribution are often handled by applying a proper transformation and then back-transforming the results to the origina scale. For example, if your data have a lognormal distribution, apply a log transformation. See also Data transformations - Handbook of Biological Statistics
I am not sure what you are using for your analysis. SAS Forecast Server can apply a transformation for you and back-transform the results for you. So can PROC ESM in SAS/ETS.