About chwell

chwell · ‎08-21-2025

Welcome to the third installment of our State Space model (SSM) series. The first post in this series (Adventures with State Space Models: Introduction) introduced SSMs as a collection of independent, additive components and detailed differences between the two component types: dynamic and static. The second post (Adventures with State Space Models 2: More Dynamic Components and Details) focused on accommodating more than one dynamic component in the model and presented some necessary details. This post builds on the previous two, and the focus is on fitting a SSM with a dynamic input variable component. The dynamic component we’ll introduce here evolves as a function of time, identically to the dynamic components introduced in previous posts. However, for many analysts, familiarity with interpreting ordinary regression models makes the idea of dynamic inputs non-intuitive at first. A dynamic input variable’s estimated effects can vary across time. For example, in a SSM that specifies units of sales as a function of a dynamic price input, a trajectory or timeseries of estimated price effects is produced. This is more general and potentially contains much more useful information than the point estimate produced by a static, linear model. This post starts with an example of estimating a static model of units of sales as a function of price and other variables. We’ll then generalize the model to include price as a dynamic input. Data: the Soda data consists of weekly observations on CASES, case sales of soda. Other variables include observations on the case price charged, OWNPRICE, competitor prices, COMPPRICE and a binary variable that codes promotional activity, PROMOTION. The sales and price variables have been log transformed, and there are about 4 years of data. The Static Price Input Model: the SSM Procedure syntax specifies LNCASES as a function of ordinary, static regressors LNOWNPRICE, LNCOMPPRICE and PROMOTION in the MODEL statement. The model does contain a, possibly, dynamic component specified with the TREND statement. LOCALLT is a local linear (LL) type. More details are given below. Details on the Local Linear trend component; the LL trend type is a more general version of the random walk (RW) trend introduced in this series’ previous post. The LL trend can be specified as follows: Dynamic trend characteristics are a function of the equations’ variances. The top equation is the Level, and the bottom equation is the Slope. If both variances, sigma-squared MU and sigma-squared BETA, are zero, the LL trend reverts to a deterministic linear trend. If the slope equation variance is zero and the level equation variance is non-zero, Beta becomes a constant, and the LL trend reverts to a random walk with drift. Setting sigma-squared MU to zero, as shown in the syntax, results in an integrated random walk (IRW) trend representation if the slope equation variance is estimated to be non-zero. Here, the IRW was identified as the best trend representation for the data through a process of trial and error. The model’s estimation results indicate a negative relationship between OWNPRICE and CASES as expected. Because both variables have been log transformed, the parameter estimate can be interpreted as an elasticity; a one percent change in OWNPRICE leads to a 1.217 percent decrease in case sales, on average over the range of the data. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. The estimated (Slope equation) variance of the IRW trend is significantly different than zero, indicating that the trend representation is dynamic. Reported Information Criteria provide baseline fit measures. The Dynamic Price Input Model: now, we want to estimate a relationship between case sales and own price that can change as a function of time. Recall from the previous post that dynamic components start as STATE equation elements, they require a variance that regulates the way they evolve, and they need to be mapped into the domain of the dependent variable via a COMPONENT statement. For commonly used dynamic model components, the TREND statement does all this for us. However, there’s not a common dynamic input variable component specification. The following syntax implements the listed steps. New syntax: The PARMS statement creates a parameter to be estimated in the model, PVAR. This is the variance associated with the dynamic price effect, and it’s restricted to be non-negative. The STATE statement creates a one-dimensional state element named PRICEEFFECT. Since it’s one-dimensional, specifying the T matrix as an identify (I) implies that this state element evolves as a random walk. The variance, PVAR enters via the COV option. As described in the previous post, state elements are recursions, so we need an approach for providing starting values. The A1 option declares the start values to be unknown or diffuse. We’ll cover starting values and other details related to the Kalman Filter in a future post. The first COMPONENT statement creates the (possibly) dynamic component OWNPRICE that enters the model. Here, the state element PRICEEFFECT is mapped into the observation equation by multiplying it by the LNOWNPRICE input variable. OP_ELASTICITY, created in second COMPONENT statement, is not used in the model. However, it’s still a valid component with an associated standard error estimate and confidence limits. We’ll use this component below to interpret how the relationship between own price and case sales evolves over time. The OWNPRICE component replaces LNOWNPRICE in in the MODEL statement. The model’s estimated parameters are shown below. The significance of PVAR indicates that the OWNPRICE component is dynamic. However, the trend, LOCALLT, has become static with OWNPRICE in the model. The estimated effects of the static input variables are roughly the same as in the previous model. Measures of the penalized, overall fit have improved substantially relative to the baseline. It appears that we have a better fitting model, and that the relationship between case sales and price is dynamic. Now, we’ll see what further information the model can provide about this estimated relationship. The OP_ELASTICITY component is plotted using the following syntax. The relationship between price and sales is estimated to be inelastic at about -0.35 in the early data. Consumers became more price sensitive over time. Own price elasticity has a maximum value of -1.15 in the week beginning 12JAN 2002. During this week, consumer responses can be described as marginally elastic, or a 1% increase in price is estimated to lead to a 1.15% decrease in sales in this week. To summarize; the producer starts with a fair amount of pricing power, but it diminishes over time. It will also be interesting to discover if consumers become more or less price sensitive at certain times of the year. To explore this, the following syntax accumulates the OP_ELASTICITY estimates to a month interval using an average accumulation method and then produces a seasonal (SC) decomposition. The seasonal decomposition is additive in this case, so the Seasonal or SC values are denominated in units of the original series (OP_ELASTICITY) and re-scaled around zero. There are twelve unique seasonal component measures, one for each month of the year. While the SC values are proportionally small compared to the OP_ELASTICITY estimates, the following inferences seem reasonable; the producer has the most pricing power in December (_SEASON_=12) on average. Consumers tend to be most price sensitive in April. In the first three posts in this series, we’ve focused on advantages of SSMs, like flexibility and interpretability, in a one Y at a time or univariate context. The next post in this series introduces another advantage of SSMs; their facility in accommodating multivariate relationships. For our multivariate demonstration, we’ll be traveling back in time to the Yukon to explore population dynamics, so stay tuned for more SSM action! Find more articles from SAS Global Enablement and Learning here.

chwell · ‎05-28-2025

The first post in this series (Adventures with State Space Models: Introduction) introduced State Space Models (SSMs) and described how the various signal components in the data, like trend, seasonality and input variable effects can be modeled individually. Modeling components individually provides information about how each one evolves over time, and, because the component models are additive, an overall forecast for the dependent variable is produced as a combination of the individual component forecasts. In this post, we’ll discuss SSMs in more detail. The discussion remains pretty basic, component-wise, and the demonstration focuses on adding a dynamic, seasonal component to a model that’s similar to the one described previously. Additional dynamic components provide a straight-forward way to introduce further details, and to describe how different types of components are accommodated in the SSM. Let’s revisit the scaled down, SSM specification from last time. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Recall that equation 1 is the Observation equation, and that equation 2 is the State. Equation 2 specifies how the model gets from one time interval to the next and regulates the model’s dynamic components. In the previous demonstration, the State represented a one dimensional, Level component for a timeseries, Sales. A random walk was a reasonable way to represent how the Level for Sales evolved over time, so we just renamed Alpha and built the model without thinking about the State details too much. Adding additional, dynamic components to the model means that the dimension of the state will need to increase. Implications of this in terms of the State elements, the variance parameters and so on will be discussed. We’ll also describe ways to put general restrictions on how dynamic components in the model evolve. Finally, we’ll discuss how the state enters or maps into the Observation equation. This information isn’t critical for the model that we’ll build here, but it will be useful for future demonstrations. First, let’s take a look at the data used in this post’s demonstrations. The plot shows quarterly observations on Widget sales that start in Q1, 1990 and run through Q2, 2021. The input, Promo, is a binary variable that flags three quarters in the data. There’s no discernible, linear looking trend, but a fairly strong seasonal component is evident. Taking a closer look at the seasonal pattern in the Widgets data, the plot shows a deterministic or static representation of the seasonal pattern. It’s derived as an additive seasonal decomposition. Since the decomposition is deterministic, there are only four unique seasonal values, as shown below. Q1 is the seasonal peak quarter, and it’s about 5.7 units (Widgets) above the annual average. The seasonal trough in Q3 is about 6.7 units below the annual average. Note that zero represents the annual average (see the plot above); the four unique seasonal factors (SC) sum to zero. Adding a Dynamic Seasonal Component to the Model In SSMs, model identification is relatively straight-forward. We’ve determined that, at least to start, the model needs to accommodate three components: a static input, Promotions, and two possibly dynamic components that will capture the level and the seasonal patterns. We’ll fit the model in the SSM Procedure (SAS/ETS) as follows. If you’ve read the first post in this series, the following statements will be familiar: The procedure statement lists the input table that contains the Widgets and Promo variables. The ID statement identifies the time ID variable, Date, and species the desired interval of the data. The TREND statement creates the dynamic, level component in the model; a random walk (rw) component named Loclin is specified. There are no added restrictions on this component’s variance. The MODEL statement specifies the Observation equation. OUTPUT generates the table containing forecasts, standard errors, CI and so on for Widgets and the model’s components. The model components are the terms listed to the right of the equal sign in the MODEL statement. STATE and COMPONENT are two new statements. These statements work together to create the model’s Seasonal component, Dyn_Seasonal. The STATE statement defines a piece of the model’s State (the other piece is allocated to Loclin via the TREND statement) and names it State_Seas. The Type option defines this part of the State to be seasonal with the default season cycle length for quarter interval data, 4. The Cov(g) option specifies that this block of the state has a general variance term associated with it that needs to be estimated. The COMPONENT statement converts the State elements, defined by State_Seas, into a component named Dyn_Seasonal that can enter the Observation equation. There are two parameters or variances estimated in the model. The State Dimension of this model is 4. More details on the State Dimension are presented below. The variance estimate associated with State_Seas is significant, which indicates that the seasonal component, Dyn_Seasonal, is dynamic and not deterministic. The variance estimate for Loclin is borderline in terms of significance. However, the plot of this component, below, indicates that a best fit, flat line is not a good match for this component’s evolution over time. The Regression Parameter Estimate for Promo represents the average impact on Widget sales for the three instances that promo = 1. Note, in SSMs, the terms ‘Model Parameters’ or simply ‘Parameters’ refer to variances or hyper-parameters that regulate the model’s dynamic properties. This is the reason that the Number of Parameters is listed as 2 in the Model Summary. The plot below shows the evolution of the Dyn_Seasonal component. The amplitude of the pattern is relatively small from 1993 to 1995 and increases in the most recent data. The plot below shows the evolution of the trend component, Loclin, over time. The forecast for Widgets (Loclin, Dyn_Seasonal and Promo) overlaid on the Loclin and Dyn_Seasonal components is below. A Look at the Details Now, we’re going to specify and fit essentially the same model but in a more detailed and involved way. What follows is not a recommendation for the best way to fit a model with dynamic Seasonal and Level components in the SSM Procedure. As we’ve seen in the syntax above, developers have provided convenient statements and options that will automatically handle the creation of commonly used model components. However, taking a more detailed or ‘roll-your-own’ approach for the model’s Seasonal component will hopefully provide intuition on parts of the model we’ve abstracted from so far and build a foundation for creating more general and interesting SSMs moving forward. First, we’ll revisit the scaled down SSM specification and generalize it a bit. We’ve added a term, or matrix, Z to the Observation equation. Z is the State Effect, and it’s main job is to map State elements into the domain of the Observation equation. This is what the Component statement was doing in the syntax above. The Component syntax peeled off the first element of the three-dimensional State (more below), State_Seas to create the model component Dyn_Seasonal. State elements must be converted into model components before entering the Observation equation. A T matrix has been added to Equation 2, and it’s called the State Transition. A new instance of the State is obtained by multiplying its previous instance by the square matrix T. In the previous post, the State was a one-dimensional random walk, so T was just the scalar, 1. There are a couple of details about SSMs that are useful to keep in mind. First, the T matrix must be square. We’ll see why below. Second, for the model to accommodate dynamic components, the raw materials or State elements that make up these components need to be specified in the State. The State, regardless of dimension, is a recursion, so State elements need to be written in this form. Let’s keep these details in mind and build some intuition with the specification of dynamic components in the SSM. We’ll start by revisiting the deterministic seasonal decomposition, described above. Because the four seasonal factors were derived using an additive decomposition, they sum to zero and are denominated in the units of the dependent variable, widgets. For the listed seasonal factors to enter the SSM as a dynamic seasonal component, they need to be specified as part of the State. Two things need to happen; first, we need to add a variance term to convert the seasonal representation from deterministic to dynamic. Second, the equation needs to be re-written as a recursion. First, we’ll add the variance. Any four sequential seasonal factors represent a full cycle. A dynamic representation can be written as follows. Now, the seasonal factors sum to zero in the mean. Gamma is a normally distributed random variable with mean zero and standard deviation, Sigma. If Gamma’s variance is zero, then the pattern reverts to being deterministic. Next, we’ll re-write the equation as a recursion, and fit it into the State equation, listed above. Any seasonal factor can be specified as a linear combination of the preceding three factors plus the random variable, Gamma. The weights in the linear combination are all -1. Accommodating the State Transition, or T matrix, the linear combination looks like the following. This is ok, but we’re not quite there yet. Remember, the T matrix must be square! In this example, the cost of making the T matrix square is adding two identities (e.g., multiply the second row of T by the current value of the state, element wise, and sum to get the second value of the state at time t+1). The variance of the random variable Gamma has also been converted into a 3x3 covariance matrix. That takes care of the parts of the State equation that regulate how the seasonal pattern evolves. Recall, there’s also a dynamic Level component. It’s specified as a random walk using the TREND statement in the model we fit above. Now, expand the State equation to include the level elements. Notice that, as State elements are added we get a block diagonal structure in the T and COV matrices. It’s worth pausing for a minute to consider this. The introduction to this series of posts began with the statement that the various signal components in the model, like seasonality and trend, are mutually independent, and that the component models are additive. The structure of the State equation with two dynamic elements in the T and COV matrices provides a straight-forward illustration of how this works. Also note that the State has four elements. Now, we’ll use the SSM Procedure to specify and fit the model with dynamic Level and Seasonal components implementing some of the details we just described. The PARMS statement creates a model parameter to be estimated named GAM_VAR. This represents the variance associated with the State’s seasonal elements, and it’s restricted to be non-negative. The first ARRAY statement creates the 3x3 block of the T matrix corresponding to the seasonal elements and names it TMAT. The second ARRAY statement creates the 3x3 block of the COV matrix corresponding to the seasonal elements and names it COVMAT. The STATE statement defines the part of the State equation that regulates the seasonal pattern. This part of the State is 3 dimensional (see above) with T and COV block elements defined in the ARRAY statements. The COMPONENT statement creates the Seasonal component that enters the Observation equation and names it DYN_SEASONAL. The notation indicates that this component is equal to the first element (the one in brackets) of the three-dimensional STATE_SEAS pre-multiplied by 1. In this case, Z is just the scalar, 1. The other statements are the same as in the previous model. Because the State blocks are independent, we don’t need to worry about details like the order that they are specified in. Our job is to specify the individual State blocks that define the dynamic components. The software figures out how to put them together to produce the overall State equation. The widgets forecast and the forecasts of the model’s individual components are essentially the same as before. The purpose of presenting the ‘roll-your-own’ Seasonal component in a SSM was to describe details that will be useful in future posts on this topic. Now, we’ve got the foundation to move forward to more interesting models. Dynamic input variables are up next, so stay tuned for more SSM action! Find more articles from SAS Global Enablement and Learning here.

chwell · ‎03-24-2025

State Space Models (SSMs) are unique among sequence data modeling algorithms; they combine advantages of machine learning algorithms, like flexibility and scalability, with the interpretability and control associated with more traditional time series models. In SSMs, the various signal components in the data, like seasonality, trend, input variable effects and so on, can be modeled individually. This approach provides details and insight about how each component evolves over time. Component models are additive, so component predictions are easily combined into an overall forecast for the target. This post will focus on a univariate (in Y) example, but generalizing the SSM to model multivariate timeseries is straight-forward as we’ll see in subsequent posts in this series. The purpose of this series of blogs is to introduce SSMs as a toolbox for applied analysts. The goal of this post is to introduce the modeling framework and build intuition. SSMs looks a little involved at first glance, so we’ll describe the general specification later and start with a scaled down version and a simple example. Once we’ve built a foundation, subsequent posts will discuss more general and interesting examples. Fundamentally, SSMs can contain both static and dynamic components. Here, dynamic means that a component varies as a function of time. As an example of a static component, consider price as an ordinary input variable in a timeseries regression model that forecasts units of sales. The values of the price and sales variables both vary over time, but their relationship, usually denoted with the parameter beta (see below) does not. This means that the effect of a $1.00 increase in price on sales has the same impact in the model’s predictions, regardless of the value of the time index or time of year. However, it’s not hard to think of an example where this static relationship is too restrictive; consumers tend to be much less price sensitive for some items in the weeks leading up to Christmas than in other, non-holiday, times of the year. A trajectory of price effects is needed to adequately model this relationship. We’ll see an example of estimating an SSM with dynamic regressors in a subsequent post. To start, a simple SSM can be written as follows. Equation 1 is called the Observation equation. It includes variables that we observe; y represents values of the dependent variable, and x represents values of the independent variables, both recorded over time. The relationship between y and listed input variables (x may be a vector) in the Observation equation is static, note that Beta does not have a time subscript. Other terms include Epsilon, a white noise error term, and the State, Alpha. Equation 2 is the State equation. The State equation provides a rule for getting the model from one time interval to the next and regulates how the components in the model evolve over time. Eta is a random disturbance term. The State equation specifies the dynamic parts of the model. To focus ideas, let’s replace y with sales and model it as function of a static input, promotions (promo). We’ll rename alpha and use the State equation to represent how the level of sales varies over time, independently of promotions. According to equation 2; the level (of sales independent of promotional effects) next interval is equal to the level this interval plus some random variation generated by the disturbance term, Eta. The relationship shown is also known as a random walk. Eta is assumed to be normally distributed with mean zero. Notice that, if the variance of Eta is zero, then the level of the series is static; it’s just a flat line. What happens to the Level component of the model as the variance of Eta increases from zero? Before answering this question, let’s look at the data we’ll use in the next demonstration. The plot shows weekly observations on Sales that start the first week of Feb. 2017 and run through the last week of May 2018. The input, promo, is a binary variable that flags three weeks in the data. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. The model described above can be fit to the data using the State Space Model (SSM/ETS) procedure in SAS as follows. The procedure statement lists the input table that contains the Sales and Promo variables. The ID statement identifies the time ID variable, Date, and species the desired interval of the data. The TREND statement creates the Level component in the model; a random walk (rw) component named Level is specified. The LEVELVAR=0 option restricts the variance of the disturbance term on the State equation, named Eta above, to be zero. This restricts the Level component to be a static, best-fit-to-the-data flat line in the model. The IRREGULAR statement creates the white noise error term on the Observation equation, named Epsilon above, with no restrictions on its variance. The MODEL statement specifies the Observation equation. OUTPUT generates the table containing forecasts, standard errors, CI and so on for Sales and the model’s components. Note that the TREND statement provides a convenient way to specify commonly used State elements. We’ll spend time discussing the specification of other State elements in subsequent posts in this series. Both estimates listed below are significantly different from zero and both come from the Observation equation; the Model Parameter estimate represents the variance of the white noise error term, and the regression parameter estimate is the effect of the promotions input on sales. Fit statistics provide a reference for subsequent model refinements. The plot shows the best-fit static representation of the Level component (red line) along with the combined components (Level, Promo, WN) forecast for Sales (black line). Sales actuals are the dots. The forecast suggests that the model is too simple; forecasts are missing systematically high through late 2017 and then systematically low to about April, 2018. Now, let’s generalize the model so that Level can be included as a dynamic component by commenting out the LEVELVAR=0 option. There are now two estimated Model Parameters: the estimated variance of the Observation equation error, WN, and the estimate of Eta, the variance on the State equation disturbance. Both are significant using at t Value > 2 rule of thumb. Note that an insignificant variance estimate for Level would not indicate that the Level is indistinguishable from zero. It would indicate that the Level component in the model is likely static, and that the previous model is appropriate. Fit statistics improved relative to the model with the static Level component. The plot of the forecast for the Level component shows its trajectory over time. The plot below shows the forecast of the Level component (red line) along with the combined components (Level, Promo, WN) forecast for Sales (black dashed line) overlaid on the Sales actuals. The goal of this post was to introduce SSMs to applied analysts and to highlight some of their main features. We described how the components of systematic variation in the data can be modeled individually and then combined to create a forecast for the dependent variable. A simple example using a dynamic level and a static regressor highlighted the SSMs flexibility in accommodating both static and dynamic components and then illustrated how the estimated components in SSMs can be assessed to provide insight into how these components evolve over time. In this series’ next post, we’ll describe how to add a seasonal component that evolves as a function of time to the model, so stay tuned for more SSM action! Find more articles from SAS Global Enablement and Learning here.

chwell · ‎12-18-2024

This post describes the essentials of how ARIMAX models work and illustrates how to interpret their interpretable parts. The intention is to help analysts better understand their project’s generated models so they can effectively communicate results and make informed choices in setting forecast model related options. This post is the second in a series. The first post, link below, defined what a transfer function is, described how numerator orders are specified and interpreted and also introduced the error series component of the model via auto-regressive orders. In this post, we’ll describe what orders of integration mean, how denominator orders work in a transfer function and consider the role of moving average terms in the error series model. Subsequent posts in this series focus on additional diagnostics that augment and extend the interpretability of models generated a SAS Visual Forecasting project. Interpreting ARIMAX Models, Part 1 Denominator orders in a transfer function As we described in the previous post, ARIMAX models quantify the relationship between an input and the dependent variable through a mechanism called a transfer function. Transfer functions consist of numerator and denominator orders. The numerator order approach for accommodating a relationship between an input and the target is to add some combination of current and lagged values of the input variable to the specification. Denominator orders in a transfer function do effectively the same thing, but they do it in a more parsimonious and restrictive way. Let’s start by considering a transfer function with a numerator order 0 and denominator order 1. The y variable is the target at time t, and x represents an input at time t. At first glance, it looks like we’ve specified a simple, but weird looking, contemporaneous relationship between y and x. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Denominator orders capture long lag, ‘dynamic’ relationships between the target and input, and the backshift operator is the key to understanding how they work. In the equation, B denotes the backshift operator, and here’s what it does; backshift operating on a variable at time t shifts it back one time interval. For example, To make things more straight forward, multiply both sides of the equation above by the denominator and rearrange, which gives us the following. To illustrate how denominator orders work, we’ll assume the relationship between x and y is a pulse/response. As shown below, the steady state value for x is 0. It switches or pulses to one for one time interval, t=3. The equation above determines the response of y. Note that we could write the denominator order relationship equivalently as a large order, numerator component as follows. If numerator and denominator orders are equivalent, why do we bother with having both? Note that in the denominator order specification, there are only two values that would need to be derived or estimated: the numerator order 0 parameter (here, 8), and the denominator order 1 parameter (here, 0.5). The numerator order representation of the same relationship has a lot more parameters. So, denominator orders can represent long lag relationships between an x and y relatively parsimoniously. However, denominator orders represent the relationship between x and y in a restrictive way. The relationship is plotted below. Denominator orders are primarily useful when the response of the target variable, y, looks like a jump with decay back to a steady state level. Modeling hurricane effects on oil production in the Gulf of Mexico is one example of where denominator orders are useful. In the month the hurricane hits, oil production jumps down. In the months following, repairs are made, rigs come back on-line, and oil production gradually converges back to its steady state, pre-event level. Larger hurricanes tend to have longer lasting effects on oil production than smaller ones, and the length of effect is regulated by both the magnitude of the initial impact, quantified by the NUM 0 parameter, and the value of the DEN1 parameter. Note that the value of the denominator order parameter must be less than one in absolute value. When would you not want to use a denominator order in a transfer function to model a relationship between y and x? Consider the following response pattern in y; there’s a build in the correlation pattern as well as a gap. The pattern of jump with decay would not be a good approximation for this relationship. Since numerator orders in a transfer function have a separate parameter for each lag of correlation persistence, they are more flexible and would be preferred to model the relationship pictured in the plot. The highlighted row below shows how a model with the Price input entering as NUM=0 and DEN=1 is represented in the software. The parameter estimates table shows both estimated parameters for the Price effect. Note, the NUM0 term with parameter estimate 12.1 is listed as SCALE. Orders of integration, or the I in ARIMA Non-stationarity means that the parameters that describe the time series are a function of time. Non-stationary variation causes problems with the correct specification and estimation of the ARMA and transfer function parts of the model if it is not handled appropriately. We’ll begin with an informal definition of what non-stationary variation is, and then describe a standard approach for handling it. To introduce this concept, we’ll break the data up into chunks. For the Toothpaste series, averages calculated in the two bracketed chunks are not substantially different from each other or from the overall series average. It looks like the series mean is not changing much as time increases, and that the data is probably stationary. On the other hand, the series average for Passengers in the most recent chunk of data is substantially different from the average in the first chunk. It’s likely that the series mean is a function of time, and that the data is non-stationary. Differencing is the most widely used approach, in the context of ARIMAX models, for handling non-stationary variation in time series data. Differencing can transform non-stationary data into stationary data. A first difference, denoted d subscript t below, usually suffices to remove non-stationary trend from the data. The plot below shows the first differenced, de-trended passengers series. If a first difference is used to transform the data from a non-stationary to stationary, then the order of integration, (the I in ARIMAX) of the model is 1. Integration is just a refined way to say, ‘add it back up’, and it describes what happens after the data is differenced, and the ARIMAX model is fit. If the data has trend, we want that trend to be represented in our forecast. The AR, MA and X components of the model are fit on the differenced, stationary data, so, initially, the predictions are on the differenced scale. The difference is then un-done on the predictions; they are ‘added up’ or integrated to get the trend component into the final forecast. Readers with a time series background may be thinking that the first difference removed the trend from the data, but there’s still a pretty obvious seasonal pattern: the first differenced data is still non-stationary! Seasonal patterns can be handled with a difference too, but in the seasonal case we’ll use a seasonal span difference. For monthly data, this usually implies a 12-span difference, shown below. For data with trend and seasonality, we can apply a first and a seasonal span difference to transform it to stationarity. In fact, there’s a classic Box-Jenkins Airline model for the Passengers data that includes a first and seasonal span difference. The highlighted row shows how the classic Airline model is represented in the software. The specification contains a first and seasonal (12) span difference, denoted with a D. The specification also contains moving average terms, denoted Q, at lags 1 and 12. The subscript s indicates a seasonal lag. We’ll conclude our discussion of basic, ARIMAX model interpretation with moving average terms, next. The final forecast from the Airline model, pictured below, illustrates how integrating, or undoing the first and 12 span differences, adds the trend and seasonal components to the predictions. Moving average terms If the dependent variable y is a stationary series, we can model it with a mixture of autoregressive (AR) and moving average (MA) terms, as shown below. Epsilon at time t represents the error, and epsilon at t-1 is a realization of the (white) noise error process. y could represent the residuals from a transfer function model, and in this context the equation represents the error series model described in the previous post. Having both AR and MA terms allows us to capture the signal in the stationary data in parsimonious and flexible way. MA and AR terms perform similar roles in the model. From an interpretability point of view, that’s the necessary information to describe them. Further, optional, details are provided next for interested readers. Using the backshift operator, we can re-write the equation above with all variables at the current time, t. Doing some rearranging shows that the ARMA model is a ratio of polynomials in the backshift operator. This looks a lot like a transfer function with NUM and DEN orders. In fact, it is! The ARMA model is a transfer function with a white noise input. An AR1 term behaves identically to the DEN1 term described above. Numerator orders are the same as MA orders. MA terms a useful for capturing choppy and irregularly shaped signal or memory in the stationary data. AR terms are more parsimonious and are useful for capturing signal with the jump-with-decay pattern. I hope you got some value from this post. Stay tuned for the next one where we’ll start discussing and adding some custom interpretability diagnostics to the software. Find more articles from SAS Global Enablement and Learning here.

chwell · ‎10-07-2024

This post describes the essentials of how ARIMAX models work and illustrates how to interpret their interpretable parts. The intention is to help analysts better understand their project’s generated models so they can effectively communicate results and make informed choices in setting forecast model related options. This post is the first in a series. In this post and the next one, we’ll focus on interpreting ARIMAX models with an emphasis on automatically generated model results you’ll see in a SAS Visual Forecasting project. Subsequent posts in this series focus on additional diagnostics that augment and extend the interpretability of models generated a SAS Visual Forecasting project. The ARIMAX model captures systematic variation in a dependent variable, and it’s useful to think about the systematic variation, or signal, in a time series in terms of components. For example, if a time series exhibits a seasonal pattern, one component of a reasonable forecast model controls for and extrapolates the seasonal component of variation. The focus of this series is on the more interpretable parts of ARIMAX models, and we’ll spend most of our time on components related to capturing systematic variation attributed to input variables. The input variable component is the X (for eXogenous) in ARIMAX. ARIMAX models quantify the relationship between an input and the dependent variable through a mechanism called a transfer function which sounds involved and tricky, but it’s not. If you’ve ever specified, fit and interpreted a linear regression model, you’ve modeled and explained a transfer function. The transfer function transfers (or transforms or filters) variation in the input into variation in the dependent variable. For simple models, one number, the estimated parameter, suffices to describe the relationship between the input and target. In ARIMAX models, variation in an input this period can impact the dependent variable this period and in subsequent periods, so a more general representation is needed to adequately model the relationship. We’ll begin with components that are called numerator orders in a transfer function. Numerator order 0 Let’s start by considering a simple regression model in the context of time series. Here, sales at time t are a function of the input price at time t plus an intercept term. An error term, delta, representing unexplained variation or noise is also added. A useful way to think of the error at time t is that it’s what’s left over after subtracting the intercept and the effect of price from sales at time t. Fitting this model to data yields the following. One tip for interpreting time series regression models is to keep an eye on the time subscripts. Note that the t subscripts on the sales and price variables are the same. This model tells us that on average, when price increases by $1.00 in interval t, units of sales decrease by 1.2 in the same interval, and that’s the end of the story; changes in price this month, for example, don’t impact units of sales in following months. In a time series context, what we’ve described is contemporaneous, or lag 0, correlation. This estimated relationship is represented in the following plot. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Price and sales may be varying continuously, but to interpret this plot it helps to think about the relationship between price and sales as a pulse and response. Imagine price having a steady state level, but, every once in a while, it pulses or increases by $1.00. The plot represents the response or deviation (-1.2 units) of sales from its own steady state level to the unit pulse in price. While this plot is not currently produced by default in SAS Visual Forecasting, we introduce it here to build intuition and will use similar plots to extend model interpretability diagnostics in subsequent posts in this series. This model, REGARIMA1_12, is listed in a Visual Forecasting project on the highlighted row below. The CONST term indicates that the model has an intercept. Note, the numbers in the model name are for internal bookkeeping and don’t tell you anything about what’s in the model specification. Numerator (0), Shift (1) Next, let’s consider a situation where changes in price this interval don’t impact sales until the following interval. This could result from contracts that are in place, menu costs and so on. The specification representing this relationship would look like the following (keep an eye on the time subscripts). Fitting this model yields; This model tells us that, on average, when price increases by $1.00 in interval t, units of sales don’t change in the current interval, but they decrease by 2 units in the next interval. In a time series context, what we’ve described is a pure delay or shift 1correlation. This estimated relationship is represented in the following plot. This model, REGARIMA1_134, is listed in a Visual Forecasting project on the highlighted row below. Numerator (0, 1, 2) Now, let’s consider a situation where an increase in price this month impacts sales this month and in the following two months. The specification representing this relationship is below. Fitting this model yields a specification that looks like the following (see the note); This model tells us that on average, when price increases by $1.00 in a given month, units of sales drop by 1.2 in that month, then drop 0.9 in the month following, and finally decline 0.5 two months later. The total impact on sales of the $1.00 increase in price is -2.6. Note, the signs on the lag 1 (t-1) and lag 2 (t-2) price parameters in the specifications are correct for the described relationship. Numerator parameters after the first parameter are specified with a negative sign in front of them. You must interpret the listed estimates keeping the implied negative sign in mind. This estimated relationship is represented in the following plot. This model, REGARIMA1_135, is listed in a Visual Forecasting project on the highlighted row below. Numerator (0, 1) AR (1) Up to this point, the error term has represented random variation or noise, but this doesn’t need to be the case. There can be systematic patterns, or memory, in the time series regression model’s residuals. The error term, in this context the error series model, shown below captures systematic variation in the dependent variable that is not attributed to variation in price with an auto-regressive order 1 or AR(1) term. The lag 1 delta term captures the memory, and the epsilon term represents the noise. More detail on AR processes below. Fitting this model yields a specification that looks like the following The relationship between price and sales is given by their estimated transfer function. This model, REGARIMA1_16, is listed in a Visual Forecasting project on the highlighted row below. Note, auto-regressive orders are denoted with a P. More details on the error series model One useful feature of ARIMAX models is that they can capture or accommodate all of the signal contained in the dependent variable’s variation. The input terms capture the impact of sources of systematic variation that are well measured and known via the transfer function. Other systematic patterns in the dependent variable can be attributed to sources like competitor activities, external events and policy changes that are not well measured and understood. The impact of some of these sources of variation can be captured through auto-regressive (AR) and moving average (MA) terms in the model and extrapolated into the future. You can think of ARMA terms as an abstract representation of underlying sources of systematic variation that we can’t measure or are too expensive to measure and create transfer function features for. The variation that ARMA terms approximate needs to be stationary. This means that these signal or memory components are mean reverting and implies that they are fairly short lived. ARMA terms are not particularly interpretable, however, they can improve the model’s forecasts and they influence model interpretability. Diagnostics and statistical tests that are based on distributional properties assume that the model’s residuals are (white) noise. If we had fit the REGARIMA1_16 model without the AR(1) component, this assumption would have been violated; there would have been a pattern in the unexplained variation in the model’s residuals. In addition to this signal component being missing from the forecast, omitting the AR(1) term means that estimated standard errors and related statistics like p-values on transfer function terms would be suspect. More details on how ARIMAX models are generated in SAS Visual Forecasting When candidate input variables are available, the software will build the ARIMAX model two ways for each series. For the REGARIMA models, the transfer function is fit first and then the ARMA model is fit on this model’s residuals. Generated ARIMAX models do the process in reverse. The ARIMA model is identified and fit on the dependent variable series, and then transfer function is identified and fit on this model’s residuals. In applications, the order that these components are specified and fit can matter in terms of overall fit or validation performance. The screen shot below shows both of the generated ARIMAX models for a single time series in a project. Both models contain the price input variable, but the two specifications are different in terms of the ARIMA components. In this case, the regression first model had a better overall fit to the data and was selected as champion. Conclusion There are more useful and interesting details about ARIMAX models to discuss and questions to answer. You may be wondering; what’s the deal with the I in ARIMAX? I’ve heard that transfer functions also contain denominator orders, so what do they good for? If AR and MA terms capture the same type of systematic variation, why do we need both? These questions will be answered in part 2 of this series, so stay tuned! Find more articles from SAS Global Enablement and Learning here.

chwell · ‎06-28-2024

SAS Visual Forecasting (VF) is an automated, large-scale forecasting solution. It can automatically generate time series models, select a champion model for each series and then generate forecasts at scale. Users can generate good forecasts for hundreds of thousands of time series using established best practices by simply providing the software some information about the data and then running their project. However, there is a lot of functionality in SAS Visual Forecasting that is not turned on by default. This additional functionality is useful for modifying and refining the automated forecasting system, and it will be the focus of this series of blogs. The purpose of this series of blogs is to introduce non-default, SAS VF functionality in the context of Model Studio projects. After the initial project is set up and run, analysts begin looking for ways to improve forecast precision. Our focus will be on introducing and describing VF functionality that enables analysts to leverage their knowledge of the data into the algorithms that do the model generation, selection and forecasting to improve overall forecast precision. Event variables are the first addition to the default functionality we’ll discuss. This series of blogs will assume readers have some basic knowledge about how Model Studio projects and pipelines are created and run. Readers new to VF projects can find some foundational background here; https://video.sas.com/category/videos/sas-forecasting SAS Education also offers a class that covers all the fundamentals, and you can sign up for the course here; https://learn.sas.com/course/view.php?id=562 What do event variables do and why are they useful? Event variables are inputs or features in time series models, and they are used to capture variation associated with events. Events manifest themselves in shocks or ‘bangs’ in the data. Consider the following simple example where a one interval shock impacts the time series a time T*. There’s a large miss or residual on the date that the event occurs. This large miss can also bias the prediction or best fit line going forwards. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Typically, event variables play the role of intercept shifters in the model, and they work a lot like a light switch. Event variables are commonly coded as a column of zeros and ones in the data. The D variable shown flags the date of the event with a one. Since the event only persists for one interval, all other dates have an associated value of zero. Adding the event variable D to the model allows the intercept to shift up and down as a function of the date or index. At T*, the intercept is equal to the sum of mu and delta (to be consistent with the picture, the delta parameter would be a negative number), and at all other dates the intercept is mu. Accommodating the event related variation with an event variable results in a much smaller residual and a less biased forecast going forward. Accommodating the effects of longer-lived events can be accomplished by changing the definition of the event variable, that is, by modifying the column of zeros and ones. Event Variables in a Large-Scale Forecasting Project Events like regulatory, policy and other structural changes in an industry can impact the majority of series in a forecasting project, and the same event can have different effects on different subsets of series. In a manufacturing context, subsets of series react differently to an event due to the mix of inputs required, supply chain dependencies, and so on. Event variables were introduced above in the context of a single series; the data was visually assessed, and an appropriate event variable was created. In a large-scale forecasting project, there’s not enough time and resources to visually assess each series and manually create appropriate event variables for each one. The goal now is to add event variables to a project to improve the performance of the generated models, but we still want the automated algorithms to do the majority of the work. The COVID pandemic provides a recent example to motivate the usefulness of event variables in a large-scale forecasting project. The pandemic started in March, 2020. In 2023, a large manufacturer considered the impact of the pandemic on the sales of goods it produces. For goods that are simple to produce and that are not distributed internationally, the pandemic effect lasted about a year. For more complex goods that require multiple stages of production and international distribution, the pandemic effect tended to last longer. To successfully capture the COVID effect on affected series, several COVID event variables with different lengths of persistence were created. The event variables were introduced into the forecasting project as candidate input variables. Algorithms picked the best fit representation of the COVID event for each individual series in the model generation process. For series that weren’t substantially impacted by COVID, the event variables were ignored. The rest of this blog outlines the syntax and steps that were followed to implement this strategy. Creating a library of event variables There are a few ways to create a library of event variables that can be used in a VF forecasting project. For this example, we’ll start with a SAS 9, Forecast Server procedure; HPFEVENTS. In the syntax shown below, the EVENTDEF statement defines new event variables. The first EVENTDEF statement names the event variable COVID_9 and sets it equal to the date of initial occurrence. In SAS, the / means options to follow. The LS type defines the event variable as a level shift or step. So far, we have a permanent step event variable that switches from zero to one in March 2020. The AFTER option truncates the step 9 intervals after the initial interval. Applied to monthly data, COVID_9 is an event variable that switches from 0 to 1 in March 2020. It’s value stays at 1 until December 2020, and it switches back to 0 in January 2021. Event variables live in SAS data sets in SAS 9. The EVENTDATA statement reads the defined event variables into the EVENTDAT table in the LOCAL library. proc hpfevents; eventdef covid_9 = '01MAR2020'd / type=ls after=(duration=9); eventdef covid_12 = '01MAR2020'd / type=ls after=(duration=12); eventdef covid_16 = '01MAR2020'd / type=ls after=(duration=16); eventdef covid_20 = '01MAR2020'd / type=ls after=(duration=20); eventdef covid_24 = '01MAR2020'd / type=ls after=(duration=24); eventdata out=local.eventdat; run; This syntax and description provide a brief introduction to the HPFEVENTS functionality. Further details can be found in the SAS Forecast Server Procedures User's Guide. Documentation can be accessed at https://support.sas.com/en/documentation.html Because we want to use the EVENTDAT table as a library of event variables in a SAS VF project, it needs to be loaded into memory and then promoted. In the syntax shown below; • The CAS statement creates a connection to a CAS session. • The CASLIB statement lists and enables us to access available CAS libraries. • The DATA step makes an in-memory copy of the SAS 9, EVENTDAT table and loads it into the PUBLIC CASLIB. • The CASUTIL procedure is used to promote the EVENTDAT table. Promotion of an in-memory or CAS table makes it globally accessible to other in-memory tools on the platform. cas; caslib _all_ assign; data public.eventdat; set local.eventdat; run; proc casutil incaslib='public'; promote casdata='eventdat'; run; A portion of the EVENTDAT table is shown. It’s important to note that the COVID related event variables are not columns of zeros and ones. The event variables we created in HPFEVENTS are rules for creating columns of zeros and ones that can be applied to time series of any length or interval. This library of event variables is portable across VF projects. Using a library of event variables in a SAS VF project Now, we’ll load the in-memory EVENTDAT table into an existing VF project in Model Studio. The VF project xxx_JUN24 was created and run under default settings. The fit measures below provide a baseline. They are aggregated measures associated with automatically generated and selected forecast models. Results correspond to series that represent sales of manufactured goods in a given hierarchy level of a manufacturing dataset. To bring our library of candidate event variables into the project, we’ll navigate to the Data tab of the project and select New Data Source menu, and then Events. Because the EVENTDAT table was loaded into memory and promoted in previous steps, it’s listed as an in-memory table under the Available tab. Once the event variables are loaded into the project, we’ll change their usage status to Try to use. This tells the model generation algorithms to handle the event variables as candidate input variables; if one or more event variables improve the fit of a model, they will be selected as an input. If not, they will be ignored. After re-running the project, the aggregated fit measures associated with the champion models have improved. About 10% of the generated forecast models at this level of the data hierarchy contain at least one of the candidate event variables. While the overall fit improved, there’s potential for further improvement by refining our definitions of the COVID effects. The event variables created in HPFEVENTS characterized the COVID effect with an abrupt shift down in the intercept followed by an abrupt shift back up to the pre-COVID status quo at the end of the defined duration. A more reasonable characterization is an abrupt step down followed by a gradual transition back to the pre-COVID status quo for most of the series. Multiple intercept shifts can be accommodated by defining additional event variables with different start dates. A RAMP event type is also available to capture more gradual transition patterns. Hopefully, this blog has provided a straight-forward example of how you can use event variables to leverage your knowledge of the business and its data into VF projects and improve the precision of your forecasts.

chwell · ‎03-15-2024

The purpose of this post is to explain and demonstrate the usefulness of the SAS Function Compiler procedure (FCMP, BASE/SAS) and related functionality in the context of processing timeseries data. The FCMP procedure lets you create and store SAS functions and subroutines. The previous posts in this series explained the fundamental ideas of processing timeseries as arrays and introduced some tools, syntax and data that will reappear here. If you haven’t had a chance to read them, links are below. https://communities.sas.com/t5/SAS-Communities-Library/Data-Step-for-Timeseries-part-1-Overview/ta-p/878419 https://communities.sas.com/t5/SAS-Communities-Library/Data-Step-for-Timeseries-part-2-BY-Group-Processing/ta-p/908380 Functions and subroutines are useful for implementing blocks of code that can be reused. They are similar in purpose to SAS Macros. The difference between functions and subroutines can be confusing. Functions return a value based on input arguments, so they aren’t suited for array operations. However, functions are useful in pretty much any programming scenario, and they provide a straightforward introduction to the FCMP procedure. Subroutines can return multiple values, in this context arrays, based on an input array. We’ll illustrate the usefulness of subroutines by converting some array processing code that was presented in the previous post into a compact, reusable block that can be stored and called during processing. The FCMP procedure functionality is available in both SAS 9 and SAS Viya. The final demonstration in this post shows how a subroutine can be compiled and then called in CAS using the TSMODEL procedure. Demonstration 1, creating and using a custom function SAS provides many pre-defined functions, and you’re probably already familiar with several focused on timeseries processing; HOLIDAY, INTNX and INTCK are some popular examples. Sooner or later, programmers will run into a situation where a function would be useful for accomplishing a programming task, but a predefined one doesn’t exist. We’ll start with a user defined function that converts prices denominated in British pounds (GBP) into dollars (USD) using a specified exchange rate. First, we’ll need a pointer to a storage location for the custom function. The compile library or CMPLIB option specifies that new functions can be compiled to the WORK library in a table named TIMEFUNC. options cmplib = work.timefnc; The new function is created in this call to the FCMP procedure. proc fcmp outlib=work.timefnc.fxfunc; function exchg_convert(pounds, rate); dollars = (rate)*pounds; return(dollars); endsub; run; The FCMP procedure statement specifies the output location using a three-level name. FXFUNC is the package name that will be used to store the function in the compile library. The FUNCTION statement names the function, here EXCHG_CONVERT, and lists the required arguments. The listed arguments are place holder names for a value in GBP and an exchange RATE. Data step like syntax creates the output value, DOLLARS as a function of the input arguments. The RETURN statement lists the value to be returned by the function. ENDSUB marks the end of function creation. It’s useful to understand how functions are compiled and stored, and the portion of the TIMEFNC table shown confirms the execution of this process. A compile library can contain multiple packages that contain a variety of functions and subroutines. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Now that the EXCHG_CONVERT function is ready for use, we’ll call it in a Data Step. Because the RATE argument will change over time, we’ve coded it using a MACRO variable to make the process of updating easier and less error prone. %let gbprate=1.2; data work.dollars; set work.items_in_GBP; format USD_price dollar10.2; USD_price=exchg_convert(GBP_price, &gbprate); run; The DOLLARS table contains the results of running the EXCHG_CONVERT function. Recall that a function returns a value. There are multiple values of USD_PRICE in the table, but these were generated line by line, or from a sequence of function calls, as the data step ran. Demonstration 2, creating and calling a subroutine in SAS 9 In the previous post in this series, a new array (feature) that flags the week that contains Easter Sunday was created for a project. Since it’s reasonable to expect that other projects with week interval data may also have an Easter effect, we’ll convert that syntax into to a reusable block of code using a subroutine. Subroutines can be created and called in both SAS 9 and SAS Viya. Since the syntax is not identical, we’ll focus on SAS 9 for this demonstration, and then cover the SAS Viya process in the next one. *Bonus extra credit challenge; generalize the definition of the subroutine below so that it can create Easter based on common date interval (DAY, WEEK, MONTH or QUARTER) input arrays. proc fcmp outlib=work.timefnc.evntfncs; subroutine Easter_evnt(DateID[*], easter[*], yr[*]); outargs easter, yr; actlen = DIM(DateID); do i = 1 to actlen; yr[i]=year(DateID[i]); easter[i] = (week(DateID[i])=week(holiday('EASTER', yr[i]))); end; endsub; run; The FCMP procedure statement references the compile library created in the first demonstration. The new package name for this subroutine is EVNTFNCS. The SUBROUTINE statement specifies the subroutine name, EASTER_EVNT. Placeholder names for required input and output arrays are listed. The OUTARGS statement lists placeholder names for the output arrays that will be produced. This statement replaces the RETURN statement used in the function creation example above. The DO block contains the array processing steps. This syntax will produce two new arrays, EASTER and YR using a date ID array as input. Pre-defined SAS functions, WEEK, YEAR and HOLIDAY, are used in the processing. The previous post showed identical syntax in a SUBMIT block in the TSMODEL procedure. The syntax in the SUMIT block was local to that call of TSMODEL. This subroutine will be available to any SAS functionality that accommodates it. In this example, we’ll call the subroutine in a the TIMEDATA procedure. Recall that this SAS 9 procedure is tuned for processing timeseries as arrays. In addition to Data Step and TIMEDATA, other SAS 9 functionality accommodates functions and subroutines created in the FCMP procedure. See the Base SAS Procedures Guide for more details on the FCMP procedure; https://support.sas.com/documentation/cdl/en/proc/65145/PDF/default/proc.pdf proc timedata data=work.wineco_sorted outarray=winecoarrays print=(arrays); id date interval=week; by region type; var sales / accumulate=total; var baseprice promotion / accumulate=average; outarrays easter yr; call Easter_evnt(date, easter, yr); title "Create a feature for each BY group that flags Easter"; run; The CALL statement references the EASTER_EVNT subroutine and produces two new arrays; EASTER and YR for each of the 16 BY groups in this level of the data. Other TIMEDATA syntax is described in the first and second posts in this series. The OUTARRAY table, WINECOARRAYS, contains an EASTER event variable. The portion shown below is for REGION1, Table Red (TBLRE) type wines. Demonstration 3, creating and calling a subroutine in SAS Viya Conceptually, this demonstration is the same as the previous one; subroutine syntax is created, compiled and then called to create the EASTER and YR arrays for the 16 BY groups in the WINECO data. Here, the data has been loaded into memory and the subroutine syntax will be compiled and called in a SUBMIT block in the TSMODEL procedure. This demonstration follows a TSMODEL procedure documentation example. See: https://go.documentation.sas.com/doc/en/pgmsascdc/default/casforecast/casforecast_tsmodel_examples02.htm First, the subroutine code is put into a SAS macro named FCMPCODE. %macro fcmpcode; subroutine Easter_evnt(DateID[*], easter[*], yr[*]); outargs easter, yr; actlen = DIM(DateID); do i = 1 to actlen; yr[i]=year(DateID[i]); easter[i] = (week(DateID[i])=week(holiday('EASTER', yr[i]))); end; endsub; %mend; Next, the EASTER_EVNT subroutine syntax is inserted into a TSMODEL procedure SUBMIT block with a call to the FCMPCODE Macro. Once the code is substituted in, it is compiled and then called within the submit block. proc tsmodel data = mylib.wineco outarray=mylib.regtypeseries outsum=mylib.regtypesum; by region type; id date interval=week; var sales /acc = sum; var baseprice promotion/acc = avg; outarrays easter yr; submit; %fcmpcode; call Easter_evnt(date, Easter, yr); endsubmit; run; A portion of the OUTARRAY table REGTYPESERIES is shown below. The SAS Viya method for subroutine creation and compilation looks different from the FCMP procedure approach, but it preserves the primary advantages of custom subroutines for array processing. This post was intended to be a basic introduction to the FCMP procedure and associated functionally in the context of processing timeseries arrays. Hopefully, you already have some ideas for custom subroutines and functions to create that will make your work easier and more efficient. Interested readers will find many more SAS FCMP examples online. Find more articles from SAS Global Enablement and Learning here.

chwell · ‎12-15-2023

BY group processing was introduced in the context of data-step-for-timeseries in Part 1 of this series. A table that will be used for BY group processing has sequences stacked on top of each other, and each sequence is processed as a separate array. Even though each BY group or array is operated on independently, there can be a hierarchical arrangement in the data that’s defined by the BY groups. The purpose of this post is to present examples of BY group processing for timeseries, and the focus will be on how BY groups can be arranged to create nested tables of timeseries with a useful hierarchical structure. Large-scale timeseries applications generally consume tables that are arranged hierarchically, so the demonstrations in this blog implement the large-scale tools in SAS Visual Forecasting. To start, consider a table that contains observations on sales, prices and promotions of wine over time. The BY variables are REGION (REG1 – REG4) and TYPE (VINTAGE, VALUE, TBLWT (table white) and TBLRE (table red)). A portion of the table is shown below. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Sequences on sales, prices and promotions that flow into the data-step-for-timeseries tools are assumed to be transactional. The analyst’s job is to create the timeseries that will be used for analyses. Choices related to the time index (interval) and accumulation methods can add relevance and increase business value if the choices are consistent with the underlying patterns in the data and business practices. BY variables provide an additional way to add relevance and usefulness by adding structure to the analysis data. For example, a two-level data hierarchy that represents brand or product type SALES flows can be created from the table shown above. It’s arranged as follows. There are 4 wine type SALES timeseries at the top level, and there are 16 wine SALES type, region pairs on the bottom level of the data hierarchy. This arrangement may be optimal if, for example, production, distribution, pricing and marketing activities are made on the basis of wine brands or types. The following code creates this two-level data hierarchy. Demonstration 1 This code creates timeseries at the TYPE level of the data hierarchy. Plots of the four wine TYPE SALES arrays are generated. The variable listed on the BY statement defines the level of the data hierarchy that is being created. proc tsmodel data = mylib.wineco outarray=mylib.typeseries outsum=mylib.typesum; by type; id date interval=week; var sales /acc = sum; var baseprice promotion/acc = avg; run; proc sgplot data=mylib.typeseries; series x=date y=sales / group=type; run; The timeseries arrays at the TYPE, REGION level are created next by adding the REGION level to the BY statement. Output data set names have been changed. SALES timeseries arrays are then plotted. Note that the Data Step syntax creates a combined group variable to make plotting the 16 series easier. proc tsmodel data = mylib.wineco outarray=mylib.typeregseries outsum=mylib.typeregsum; by type region; id date interval=week; var sales /acc = sum; var baseprice promotion/acc = avg; run; data plotin; set mylib.typeregseries; sep='_'; grp = cats(type, sep, region); run; proc sgplot data=plotin; series x=date y=sales / group=grp; run; An important detail is that while each array in the data is operated on independently, the ID, VAR and BY statements combine to uniquely define the timeseries arrays that are created. Alternatively, assume that decisions related to production, distribution, pricing and marketing activities are made based on geographic or regional sales flows. The following hierarchical arrangement may be optimal in this scenario. Demonstration 2 This code creates timeseries at the REGION level of the data hierarchy. A plot of the four wine SALES series at this level of the data hierarchy are generated. Note the BY statement defines timeseries at the REGION level of the data hierarchy. proc tsmodel data = mylib.wineco outarray=mylib.regseries outsum=mylib.regsum; by region; id date interval=week; var sales /acc = sum; var baseprice promotion/acc = avg; run; proc sgplot data=mylib.regseries; series x=date y=sales / group=region lineattrs=(thickness=0.5); run; The timeseries at the REGION, TYPE level of the data hierarchy are created next by adding the TYPE variable on the BY statement. Output data set names have also been changed. New feature creation (bonus!). In addition to the BY group processing, two new arrays are created for each of the sixteen BY groups in this call to the TSMODEL Procedure. EASTER and XMAS are binary arrays that may be useful as input variables to capture variation associated with recurring, holiday events. EASTER is 1 for week intervals that contain Easter Sunday and zero otherwise. XMAS is a binary array that flags the week intervals that contain 25DEC and zero otherwise. YEAR, WEEK and HOLIDAY are BASE/SAS functions. See Part 1 of this blog series for a discussion of creating new timeseries arrays with SUBMIT blocks in TSMODEL. proc tsmodel data = mylib.wineco outarray=mylib.regtypeseries outsum=mylib.regtypesum; by region type; id date interval=week; var sales /acc = sum; var baseprice promotion/acc = avg; outarrays easter xmas yr; submit; do t=1 to dim(sales); yr[t]=year(date[t]); EASTER[t] = (week(date[t])=week(holiday('EASTER',yr[t]))); XMAS[t] = (week(date[t])=week(holiday('CHRISTMAS',yr[t]))); end; endsubmit; run; data plotin; set mylib.regtypeseries; separator='_'; grp = cats(region, separator, type); run; proc sgplot data=plotin; series x=date y=sales / group=grp lineattrs=(thickness=0.5); run; A portion of the OUTARRAY table, REGTYPESERIES, that contains the new features is shown. We’ve shown how BY groups work in the context of processing timeseries as arrays with a focus on how different arrangements of BY variables can be used to create and define arrays in different ways. While each array in each level of the data is operated on independently, BY variables can be used to create, useful, nested arrangements of data. A core idea is that the hierarchical data produced in BY group processing should be consistent with business practices and the underlying patterns in the data. This leads to increased relevance, value, and efficiency in subsequent modeling and post-processing steps. Find more articles from SAS Global Enablement and Learning here.

chwell · ‎05-31-2023

The purpose of this article it to provide an overview of concepts and SAS tools related to creating and processing timeseries as arrays. The usefulness of SAS timeseries array processing functionality, also known as the SAS data-step-for-timeseries toolbox, is illustrated in three demonstrations. A timeseries is an indexed set of equally spaced values. Information or signal can exist in the order of and distance between values, so sequences need to remain intact in the processes of timeseries data creation, exploration and processing. A natural way to think about a timeseries in the context of data handling is as an array. An array provides a way to process a sequence of values based on an index and other user provided attributes. Timeseries data handling based on the idea of array processing is featured in both SAS Viya and SAS 9, and we’ll generally refer to this functionality as the SAS data-step-for-timeseries toolbox. The purpose of this series of articles is to introduce and explain the tools and to illustrate their usefulness through a series of examples. This article provides an overview of concepts on creating and processing timeseries as arrays. Subsequent articles are previewed here with three demonstrations. Article 2 will focus on timeseries BY group processing. Multiple timeseries arrays are defined and processed using BY group or sub-setting variables. Article 3 focuses on creating user defined subroutines and functions and then using them in an array processing block of syntax. Topics covered in future articles will depend on reader feedback, so let us know what you think and provide suggestions for data-step-for-timeseries topics. Demo 1, Initial and New Arrays In the first example, we’ll use the AIR data set in the SASHELP library. This table and contains two variables: a count of US airline passengers, AIR, and a time index, DATE. The natural interval of the data is month and there are 144 observations. A portion of this table is shown. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. It may be useful to think of the TIMEDATA Procedure (SAS/ETS) processing shown here in two steps. First, selected variables in the input data set are named and initial arrays are created. Second, new arrays are created by operating on elements of initial arrays defined in the first step. The PROCEDURE statement lists the input data set and two data sets that will contain the results of the processing. The OUT=WORK.AIR table will contain only the time ID and the initial arrays listed on the VARS statement. The OUTARRAY table contains the time ID, the initial arrays and any new arrays created in processing. The ID statement names the time index variable from the input data set and lists the desired interval and accumulation method for array creation. The VARS statement lists the initial arrays. The OUTARRAYS statement names new arrays that will be created in subsequent processing. The DO block contains the processing for new array creation. Note that the ID and VARS statements combine to uniquely define the array, AIR. In this case, one observation on passenger count per quarter is derived by averaging the monthly observations in the input table. Then, the DO block syntax creates four new arrays by operating on elements of the array, AIR. proc timedata data=sashelp.air out=work.air outarray=work.airarray print=(arrays); id date interval=quarter accumulate=average format=yymmdd.; vars air; outarrays rw_trend lin_trend quad_trend s4 c4; twopi=2*constant("pi"); do t= 1 to dim(air); rw_trend[t] = air[t-1]; lin_trend[t] = t; s4[t] = sin(twopi*t/4); c4[t] = cos(twopi*t/4); end; Title "Create arrays for different trends and sinusoids"; run; Let’s investigate the results of the TIMEDATA call and discuss some details. The AIR variable in the WORK.AIRARRAY table is a quarter interval time series. It’s first value is the average of the first three values of (month interval) AIR in the input table. The DATE variable has a quarter interval as specified in the ID statement. Four new arrays are created. These are common timeseries model features. RW_TREND: a random walk prediction for the next interval is the observed value for the current interval. The lagged (t-1) value for AIR is assigned to current (t) value for RW_TREND in the DO block. LIN_TREND: this array simply increments by one for each successive time interval. It could be useful as an input variable in a model to capture trend variation in a deterministic way. S4 and C4 are a sine, cosine pair that repeat every four intervals. These can be useful for capturing a season pattern in quarter interval data. Note that SAS functions, commonly found in DATA step syntax, are valid to use in the TIMEDATA procedure. TIMEDATA and other tools implementing the data-step-for-timeseries approach accommodate most of the SAS programming statements and SAS functions that you can use in a DATA step. Demo 2, BY Group Processing This example will feature SAS Viya. While this software framework is different than SAS 9, shown in the first demonstration, the approach is consistent and the data-step-for-timeseries tools are implemented in a similar way. In general, a table that will be used for BY group processing has sequences stacked on top of each other, and the data is sorted according to the BY variables and the time ID. In the simple table we’ll use here there’s one sub-setting variable, and the data has been sorted by: BY_GRP, DATE. BY_GRP values identify two P_STATUS sequences. A subsequent article will illustrate how a table with more BY variables is organized. Notes on the TSMODEL procedure (SAS Viya/Visual Forecasting) syntax: BYGRP_IN is an in-memory table that lives in a CAS library, MYLIB. Three in-memory tables are produced. The OUTARRAY table contains the time ID, the initial arrays, here P_STATUS and any new arrays created in processing. The OUTSCALAR table will contain system scalars defined in the processing. We’ll describe system scalars shortly. The OUTSUM table provides summary measures on initial and new arrays created in the TSMODEL Procedure. The ID and VAR statements combine to uniquely define initial arrays, here P_STATUS. Since the data flowing in already has a monthly interval, nothing was really changed in the process of creating the initial arrays in this example. INTERVAL and ACCUMULATION options can be used to create initial arrays in other ways. OUTARRAY lists new arrays to be created in subsequent processing, here LN_PSTATUS. OUTSCALAR declares the system scalars to be created in subsequent processing, here SUM_SQ. A scalar is a single, system generated value associated with an array. The BY statement identifies the sub-setting variable in the input table. Data step like processing occurs in a SUBMIT block in the TSMODEL procedure. The DO loop runs from the first observation of each sequence to the last. A main point here is that each BY group’s values are operated on as a separate array. It may help to think of the DO loop as sequentially processing each BY group, however it should be noted that in SAS Visual Forecasting BY groups are distributed, and the processing occurs concurrently. The values of each new array, LN_STATUS are derived by log transforming corresponding values of P_STATUS. System scalars, SUM_SQ, are derived as an accumulating sum of squared values of LN_PSTATUS. proc tsmodel data = mylib.bygrp_in outarray=mylib.bygrp_out outscalar=mylib.scalars outsum=mylib.summary_stats; id date interval=month; var p_status; by by_grp; outarray ln_pstatus; outscalar sum_sq; submit; do t = 1 to dim(p_status); ln_pstatus[t] = log(p_status[t]); sum_sq += ln_pstatus[t]**2; end; endsubmit; run; Let’s investigate the results of the TSMODEL procedure call and discuss some details. The MYLIB.BYGRP_OUT table contains the two P_STATUS (group 1 & 2) timeseries defined by the ID, BY and VAR statements and the two new timeseries, LN_PSTATUS created in the SUBMIT block. The MYLIB.SCALARS table contains the generated system scalars. One scalar is created for each LN_PSTATUS array. A portion of the columns of the MYLIB.SUMMARY_STATS table is shown. Summary statistics on each of the four timeseries are listed. Demo 3, Creating and Calling a User Defined Subroutine This example switches back to SAS 9. In the first part, a user defined subroutine is created. The subroutine is then called in the TIMEDATA procedure to create a new array. The SAS Function Compiler procedure (FCMP, BASE/SAS) lets you to create, test and store SAS functions, CALL routines and subroutines. Here, a subroutine named MYLEAD is created and then stored in a compile library that can be referenced in subsequent steps. options cmplib = work.timefnc; proc fcmp outlib=work.timefnc.funcs; subroutine mylead(actual[*], transform[*]); outargs transform; actlen = DIM(actual); do t = 1 to actlen; transform[t] = (actual[t+1]); end; endsub; run; quit; The ID and VARS statements combine to create the quarter interval array AIR. The new array, LEADAIR, is declared in the OUTARRAYS statement and then created by calling the MYLEAD subroutine. The arguments, AIR and LEADAIR correspond to ACTUAL and TRANSFORM listed in the definition of the subroutine. proc timedata data=sashelp.air out=work.air2 outarray=airarray2 print=(arrays); id date interval=qtr accumulate=total format=yymmdd.; vars air; outarrays leadair; call mylead(air, leadair); run; A portion of the OUTARRY table, AIRARRAY2 is shown. As you can see, the array approach provides a flexible and efficient approach to timeseries data handling. In the next article, we’ll discuss BY group processing in more detail and provide more in-depth examples. Stay tuned for more data-step-for-timeseries action! Find more articles from SAS Global Enablement and Learning here.

chwell · ‎10-09-2017

Hi Maria. Sorry if I miss-understood you question. The way to produce pre-whitened CCF plots in Proc ARIMA is to: 1) identify the input or X variable. 2) Estimate a model for X that results in white noise residuals. This model is the pre-whitening filter. 3) Identify the Y variable and list the X variable in a crosscorr=(X) option. Modifying the code sent earlier; proc arima data=in; identify var=x; estimate p=(1)(6) q=(6) ml;; identify var=y crosscorr=(x); run; Hope this helps. Chip

chwell · ‎09-28-2017

Hi Maria. It looks like you have taken the appropriate steps to pre-whiten the x for transfer function identification. If you send details of how the x variable enters the model, the pre-whitened CCF will work, I'll be happy to help with that. As far as your ARIMA specification, the syntax below should specify the model. Note, I'm assuming that the paretheses indicate a factored specification, and that the second set of parentheses indicate seasonal factors. proc arima data=in; identify var=x; estimate p=1 q=1; identify var=y crosscorr=(x); estimate input=(<transfer fnt for x goes here>) p=(1)(6) q=(6) ml; run; Also, you should check out the University Edition of SAS Studio. We have created some forecasting tasks that allow you to create ARIMA specifications in a point and click environment, and then see the corresponding model syntax. See, https://www.sas.com/en_us/software/university-edition/download-software.html Hope this helps, and feel free to follow up. Best, Chip

chwell · ‎07-25-2014

Hi. You have created the effect correctly using you data step. This is seen by adding the following statement after the ESTIMATE statement and before the RUN in PROC ARIMA; FORECAST lead=0 printall; As the forecast plot shows, your estimated a transfer function that generates a gradual build to a new status quo. The IML portion of the code is outputting a data set that represents only the initial impact of the step and associated decay. That is, it is only capturing the effect of the first 1 in the step dummy. My IML is rusty, but the idea (and a brute forece way) would be to increment time; 1, 2, 3, 4, .... and then add effects corresponding to each interval and previous intervals. For example, looking at the Psi plot generated by your code, the effect, under a step intervention, at time 2 would by the effect at lag 2 + effect at lag 1. Effect at lag 3 = effect at lags 3 + 2 + 1. And so on. Feel free to follow up and discuss details. My email is chip.wells@sas.com

chwell · ‎09-09-2013

Hi Andreas. I tested this behavior in FS version 12, and it is consistent with what is listed in the help doc (see Data --> Updating in Project). Followig Udo's note, if you choose to rediagnose on opening an existing project, the newly added series (e.g. SKUs that did not exist in the data when the project was last closed) are handled under the defaults just like any other series; e.g. since I had candidate independent varibles, 2 arimax and one ESM were diagnosed for each of the new SKUs. If you choose some other update option, e.g., Select, then the BEST (ESM only) list is used to provide a model and generate forecasts for the new series. I don't have access to version 4.1 of FStudio, but if you want to test, I have two data sets that are identical except for one type (SKU) designation. You can create the project on the restricted one and close. Then, delete the restricted data set in the SMC, and rename and register the unrestricted one. Re-opening the project will generate the screen that Udo shows above. Hope this helps. Chip

chwell · ‎01-16-2013

Hi Andreas. The parameter estimation algorithms in Forecast Studio and ETS (TSFS) are very similar; a model that is successfully fit in FStudio should not have problems being fit in ETS and vice versa. The first thing I'd do is double check to make sure the specifications are identical. That is, is the dependent variable first differenced in your ETS specification, ... . Something else that can be tricky is that pure delays in independent variables in a transfer function are called 'shifts' in ETS and are denoted, e.g.; in PROC ARIMA (1 $ input1, ..). Pure delays are called 'delay' in Procs that run under the hood in Forecast Studio, e.g. PROC ARIMASPEC. The TSFS system is a handy tool, but it's functionality is very limited relative to Forecast Studio. The TSFS system allows users to fit ARIMA and ESM models from prespecified lists and specify their own custom ARIMAX models. The procedures that run 'under the hood' in Forecast Studio can do this plus: automatically identify or build a transfer funciton and error component specificaiton from scratch based on patterns in the data, identify UCM specifications, do automatic model selection, create, fit estimate and select the best models for hunderds of thousands of series .... The list goes on. If you would like to learn more about Forecast Studio, let me know. Best, Chip

Online Status	Offline
Date Last Visited	3 weeks ago

Adventures with State Space models 3: Dynamic Input Variables

Adventures with State Space Models 2: More Dynamic Components and Deta...

Adventures with State Space Models: Introduction

Interpreting ARIMAX Models, Part 2

Interpreting ARIMAX Models, Part 1

Improving Your Generated Forecasts in SAS Visual Forecasting; Part 1, ...

Data Step for Timeseries: Part 3, the FCMP Procedure and Related Funct...

Data Step for Timeseries: part 2, BY Group Processing

Data Step for Timeseries: part 1, Overview

Re: ARIMA-transfer function

Fantasy Football Lineup Optimization: Part 1

Re: proc arima transfer funtion

Re: Data Hierarchy Updates in Forecast STudio

Adventures with State Space models 3: Dynamic Input Variables

Adventures with State Space Models 2: More Dynamic Components and Deta...

Adventures with State Space Models: Introduction

Interpreting ARIMAX Models, Part 2

Interpreting ARIMAX Models, Part 1

Adventures with State Space models 3: Dynamic Input Variables

Adventures with State Space Models 2: More Dynamic Components and Deta...

Adventures with State Space Models: Introduction

Interpreting ARIMAX Models, Part 2

Interpreting ARIMAX Models, Part 1

Improving Your Generated Forecasts in SAS Visual Forecasting; Part 1, ...

Data Step for Timeseries: Part 3, the FCMP Procedure and Related Funct...

Data Step for Timeseries: part 2, BY Group Processing

Data Step for Timeseries: part 1, Overview

Re: ARIMA-transfer function

Re: ARIMA-transfer function

Re: proc arima transfer funtion

Re: Data Hierarchy Updates in Forecast STudio

Re: Forecast Studio vs TSFS in an ARIMA model with dynamic regressor