12-20-2012 11:47 AM
I am trying to both interpolate using proc expand and extrapolate using proc forecast or proc esm . Since I have a good bit of missing values, some of the estimates seem very extreme. Is there a way to set confidence limits, or boundaries, for estimates using these procedures? Or are there other procedures that I could use to set boundaries on the estimates? I would like to use cubic splines for interpolation and exponential smoothing (double or triple) for extrapolation.
12-20-2012 12:20 PM
You have different models for interpolating and extrapolating? I have never heard of that. I'd be curious to understand the logic behind this.
I am not aware of methods in PROC ESM to set boundaries on estimates, as this really implies that your model isn't a true exponential smoothing; nor have I ever seen an exponential smoother that has constraints on the forecasts. You could always apply the boundaries to the forecasts in a subsequent data step, but then I guess the issue is how you come up with the boundaries. By doing so, you cross over the boundary from empirical statistics to "getting the answer I want".
12-20-2012 12:36 PM
Well, I decided to use cubic splines to interpolate embedded missing values because that is considered one of the best methods to do so (after some google research). Using cubic splines to extrapolate beginning or ending series values does not produce accurate results, so I decided to use exponential smoothing to extrapolate. I interpolate first, then apply exponential smoothing using proc forecast to get the missing values that remain at the end of the series. The expo method in proc forecast (and proc esm I believe) takes into account the entire series when producing estimates. Since I have many missing values to start with, applying cubic spline estimation first provides a better dataset to use for the expo smoothing. That is my logic. Of course, if you have other advice I am all ears, as this is my first time tackling this sort of problem. However, I've read many articles about different alternatives. The stepar method may be a better option for forecasting with the type of data that I have, however I have too many missing values to support that method, even after interpolating. Thus, the expo method is the next best option.
12-20-2012 12:50 PM
Sorry about the extra message: But to address your other comment about boundaries: I am just using the data series that I have currently available. I know proc esm is about modeling, but I don't know much else about the procedure other than it can also do double exponential smoothing. I just want my results to be as accurate as possible, not "made up". But I do know that especially when I simply use splines to interpolate and extrapolate, I get ridiculous values; but most of the interpolated estimates are reasonable except for a few cases. For those where the spline method pretty much estimates a zero value or an extremely large value that I know is not possible, I was thinking of using the mean of the series. The extremes are the result of many missing values back to back, where the spline method simply continues on the same trend until it reaches a nonmissing value. The main idea is to not have zeros, which in my case are equivalent to missing values. So that's why I was trying to set a boundary, perhaps using a confidence interval or something. I may just have to flag estimates that are extreme and use the mean it its stead. I am not forecasting out by the way, just trying to fill in the gaps.
12-20-2012 03:09 PM
Ok, I see now.
What you are doing with the splines to interpolate missing values is what I would call "imputation". Look up the word, there's a lot that has been written on "imputing missing values". Splines are reasonable things to use here, given certain assumptions (and by the way, since I don't understand the entire set of complexities of this problem, I am not ruling out any better choices, I just wouldn't even want to venture a guess as to what would be best because I don't know your data or your problem).
Regarding extrapolation, this is a fact of life for people who extrapolate. The model can deteriorate into ridiculous values. If you don't want zeros -- can I assume you don't want negative numbers either? -- then perhaps you simply need to transform your dependent values before modelling so you can't get a zero, for example on a logarithm scale, and then model. When you predict, even if you get a negative prediction on the log scale, un-transforming gives you a positive value. This doesn't eliminate the possibility that the model will deteriorate into ridiculous values, they just won't be zero. And also, modelling the log of something is not the same as modelling the actual data, but nevertheless it can be a useful thing to do. ("All models are wrong, but some are useful" -- George EP Box)