BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jitb
Obsidian | Level 7

I am trying to build a confidence interval of a predicted point in a time series that is lognormally distributed. I am predicting the value of just one future time point. My process has been to randomly simulate 1,000 distributions that have the same lognormal parameters (theta, sigma) as the original time series. The simulated distributions have one extra time point, i.e. the point value I am trying to predict. I choose the simulated distribution that has the least weighted average difference from the original series. Once I choose the best simulated distribution, I use the extra time point value as my prediction. Next, I would like to build a confidence interval around this predicted point. Would I be able to do this as

Point Value +- 1.96*sigma/sqrt(n)....at 95% CI?

Any help on this would be much appreciated. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

@jitb wrote:

Would I be able to do this as

Point Value +- 1.96*sigma/sqrt(n)....at 95% CI?


I don't think so, not if you want to follow a lognormal distribution, as ±1.96 is likely not a meaningful quantity of a lognormal distribution. Whatever distribution you select from the process you describe, you find the spot on the distribution where 2.5% of the distribution is to the left, and the spot on the distribution where 2.5% of the distribution is on the right. These define the interval you want.

--
Paige Miller

View solution in original post

12 REPLIES 12
PaigeMiller
Diamond | Level 26

@jitb wrote:

Would I be able to do this as

Point Value +- 1.96*sigma/sqrt(n)....at 95% CI?


I don't think so, not if you want to follow a lognormal distribution, as ±1.96 is likely not a meaningful quantity of a lognormal distribution. Whatever distribution you select from the process you describe, you find the spot on the distribution where 2.5% of the distribution is to the left, and the spot on the distribution where 2.5% of the distribution is on the right. These define the interval you want.

--
Paige Miller
jitb
Obsidian | Level 7

Thank you, Paige, for your response. You mean, take 2.5 and 97.5 percentiles of the distribution as the bounds? That makes sense. A further query, would you think taking the mean of the predicted point from the top 100 simulated distributions (based on my weighted score) would give me a better estimate? I guess that's why I was thinking of 1.96 from the central limit theorem. This is a different question, I know. 

PaigeMiller
Diamond | Level 26

Yes, if you are going to average 100 points from the top 100 distributions, then I would think the Central Limit Theorem would apply, but you still ought to see how similar these points are via plotting the points and the distributions (for example, if somehow these points wind up to be bimodal, unlikely if they are all lognormal, and also if there is an extreme outlier or two, but you never know, then maybe the Central Limit theorem doesn't get you there).

--
Paige Miller
jitb
Obsidian | Level 7

Hi Paige,

 

Yes...I need to plot the points. I will take your suggestion and use the 2.5 and 97.5 percentiles to construct the CI. Thanks so much for your advice on this!

SteveDenham
Jade | Level 19

You have what you need for a bootstrap estimate of the mean and confidence interval. I wouldn't choose any "best' simulation as that is going to be strictly a function of the random values used to generate your time series.  Instead, your best predictor is simply the mean of the new point across the 1000 simulations, and the confidence bounds would be as @PaigeMiller pointed out - the 2.5th percentile and the 97.5th percentile.  You can get all of these with one call to PROC MEANS.

 

SteveDenham

jitb
Obsidian | Level 7

Hi Steve,

 

Yes, I think I will look at the mean and median of the 1,000 observations. I couldn't find a way of getting the 2.5 and 97.5 percentiles from Proc Means, but was able to get them from Proc Univariate with the pctlpts option in the output statement. Thanks so much for your insights on this.

Ksharp
Super User
I don't think so. 1.96 is for Normal distribution, NOT for lognormal .
Calling
@Rick_SAS
jitb
Obsidian | Level 7

Yes...I am discarding the 1.96 for this. Thanks. 

SteveDenham
Jade | Level 19

1.96 is fine for a large lognormal population, so long as you are doing calculations in the log space.  Confidence bounds on the original scale could be obtained by exponentiating those obtained using the 1.96 factor on the log space bounds. This is because the variance of the lognormal distribution is not assumed to be a function of the mean, so the logs of the values are assumed to follow a Gaussian distribution.  For analysis of variance purposes, this means that the residuals in the log space are normally distributed.

 

SteveDenham

jitb
Obsidian | Level 7

Yes....thanks for pointing that out, Steve. I get that. My concern is that about 80% of the variable values in the original series are between 1 and 5. The remaining 20% range from 6 to 33. If I take the mean of the 1000 simulated distributions, it will, in most cases, lie between 2 and 3. The confidence interval will be very wide, e.g. between 1 and 13. I'm thinking about how to handle predicting these outliers. Maybe a mixed distribution? I've never done a mixed distribution before. But, thanks for your insights. Much appreciated.

SteveDenham
Jade | Level 19

@jitb  - it might be a mixture, in which case the bootstrap confidence bound is more likely to provide proper coverage.  However, on the log scale, your values range from 0 to about 3.5 with a probable peak around 1.  So a mean on the original scale of 2.7 or thereabouts makes sense.

 

However, I sense something interesting here.  It looks like your raw values are bounded away from 0.  Have you considered a gamma distribution for the values?  It has a closed form mean and variance (small bias involved compared to ML estimators).  And there is a compound gamma distribution (also with closed form estimators) that is essentially a mixture of two gamma distributions having the same mean but differing variance. PROC FMM on the 1000 simulated values with 2 gamma components compared to a single component by AIC sounds like a good approach.

 

In any case, that bootstrap mean is still likely to be your best estimator of central tendency..  Using the least weighted average difference should approximate the median, so you could check the 50th percentile value of the bootstrap sample against it.  I worry that the "best" may be way out toward one or the other tails.

jitb
Obsidian | Level 7

An interesting suggestion. I will try the compound gamma distribution. Proc Severity is indicating a Burr distribution for the tails. If I do a mixed distribution, would I have all the tails after a certain time period? That would not mimic my original time series well. Thanks, Steve!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 1581 views
  • 5 likes
  • 4 in conversation