Forecasting Concepts Part 1: PLEASE Don’t Use Ordinary Regression

3 Likes

If you have been to one of my courses where I touch on forecasting, you have heard my rant on this before. Please! Do not use ordinary least squares (OLS) regression to forecast to the future! This article will explain why not.

Why Can’t I Use Ordinary Regression?

Sadly, in my checkered past, I have seen innocent, well-meaning folks use time as the independent variable in ordinary least squares regression to accomplish forecasting. Do not do this! One good reason is assumption violations.

Ordinary regression models have a number of assumptions.

Normality
Independence
Homoscedasticity (constant variance)
Linearity

Data with a time component violate these assumptions causing the following issues:

Lack of normality affects standard errors and may in some cases also affect the parameter estimates.
OLS regression assumes that the error terms are independent and identically distributed (IID). The independence assumption is violated in time series data because of autocorrelation (also called serial correlation). This does not affect the parameter estimates in the limit, but the standard errors are compromised, and estimates might be affected for small sample sizes; t-statistics and associated p-values will be wrong.
Heteroscedasticity (variance that is not constant) does not affect the parameter estimates but does compromise the standard errors.
Lack of linearity means you have the wrong kind of model, and your results can be meaningless.

(Information above largely from Forecasting Using SAS Software: A Programming Approach by Dickey & Woodfield)

Recall from my earlier article of goal-seeking and scenario analysis that homoscedasticity (also called homogeneity) of variance means that variances (error) are equal/constant. Heteroscedasticity is the opposite and means that the variance (error) changes.

Data

Traditional regression methods are performed on data at a single point in time, or where time is not even considered. Cross-sectional data would be appropriate for this.

Cross-sectional data – a collection of observations for multiple individuals at a single point in time. The following table shows an example.

Aside: For perspective, what is 7,500 calories? Well, it could be eight 12-ounce steaks.

Or…four Valentine’s boxes of chocolates.

Let’s just say that it would be much easier to eat four boxes of chocolates in a day. Don’task me how I know this.

Recall that time series analysis requires an historic set of data that includes repeated measures.

Longitudinal and Panel data – individual (or other entity) observations measured repeatedly over time. If there is more than one individual measured over time these are called panel data and may be called cross-sectional time series data. This data may be transactional, that is measured at various times by individuals. The following table provides an example.

Longitudinal (panel) data are useful for distinguishing cohort effects from aging effects. Let’s say an effective reading program was begun three years ago in a new public kindergarten. In a cross-sectional study comparing reading level by age, we may find 9-year-olds to be poorer readers than 8-year-olds. This can be a cohort effect (the 9-year-old cohort did not get the early reading training that the 8-year-old cohort got) as opposed to an ageing effect. If the reading level of students is measured over time for the SAME individuals as in a longitudinal study, the cohort effect is removed and the ageing effect can be accurately measured. (Adapted from Diggle et al. 1994.)

Time Series – is an indexed set of data over equally-spaced time periods. Note that this is an example for illustration purposes, and you would absolutely never conduct a forecast from only three periods!

Many time series analyses require that you create a time series from the transactional data. This is commonly done by taking the average, minimum, or maximum for given time periods. In the fictional example above, I have taken the average for each month.

Using our weight illustrates the point that subsequent measures are not independent. How much I weigh this month is not independent from how much I weighed last month.

Doing it the Right Way

If OLS regression is the wrong way, then what is the right way? You must:

Ensure that your data are a proper time series, or turn them into one (remember…equally spaced periods)
Evaluate the time series, for example for trends and cycles.
Use methods such as exponential smoothing and ARIMA.

I will discuss this further in future posts.

Proper forecasting methods are available in SAS forecasting tools, making it easy for you to use them. For more specifics on forecasting with SAS tools, visit the forecasting courses listed at the end of this article and my summary of this in Forecasting Concepts 4.

Sources and Additional Information

Diggle, Peter J., Kung-Yee Liang, and Scott L. Zeger. 1994. Analysis of Longitudinal Data.
SAS Education Course Forecasting Using SAS Software: A Programming Approach by Dickey & Woodfield.

Unsolicited Advice for Valentine’s Day:

DON’T: Eat half of the chocolates in a heart-shaped box and then give your true love a half-empty box.

DON’T: Re-gift a box of chocolates that your ex gave you last year to your new love.

DON’T: Decide at 6 pm that you will go out to your favorite restaurant on Valentine’s Day.

DO: Make a reservation or plan to go out on a different night.

DON’T: Take your true love to McDonald’s on Valentine’s Day if you are over the age of 11.

DO: Write a love poem.

DON’T: Print it out in

DO: Print it out in

And finally, what you’ve always wondered…the correct answer to “Does this outfit make me look fat?” is “Of course not; you look amazing, honey!” followed by, “Try this chocolate truffle.”

You’re welcome.

Hessner · ‎07-23-2021

Thanks! Great article.

Forecasting Concepts Part 1: PLEASE Don’t Use Ordinary Regression

Free course: Data Literacy Essentials

Get Started