Performance Improvements in PROC CSSM for Scalable State Space Modeling

2 Likes

By Rajesh Selukar

Introduction

PROC CSSM has several new performance enhancements. Of all these, the introduction of a new more efficient likelihood gradient algorithm has the most significant impact on its scalability. In the latest version of PROC CSSM (2022.1.3 release), you might notice significant speedups for some of your programs. Generally, noticeable speedups can be seen for programs with large data sets as well as large and complex state space models. For instance, if you run the example A Dynamic Factor Model for the Yield Curve, you will notice that your program runs about four times faster compared to previous releases. When you run a slightly modified version of this program where you use a non-diagonal autoregressive matrix in the VARMA specification of the zeta factor (state zeta(3) type=VARMA(p=1) cov(g);), it’s as eight times as quick. Moreover, this slightly more complex state space model fits the data better. Large data sets and complex state space models are common in many fields. For example, in the analysis of data that are generated because of multi-level, multi-subject, longitudinal studies, complex state space models are a norm. For more information on the state space modeling of such hierarchical, longitudinal data, see Functional Modeling of Longitudinal Data with the SSM Procedure.

In some situations, the scalability improvements in the latest version of PROC CSSM can reduce the program's running times from hours to minutes and from days to hours. The new algorithm is based on the backpropagation (or adjoint) method in the algorithmic-differentiation (AD) literature. Prior to this, PROC CSSM used a Kalman filter-smoother (KFS) based gradient algorithm. The new backpropagation-based algorithm turns out to be much more efficient than the KFS-based algorithm, particularly when at least some of the model parameters affect the transition matrix of the state space model. Next, let’s summarize the results of two small performance studies that investigated the performance gains due to the new gradient algorithm.

Time Comparison of Old and New Gradient Algorithms

When the parameter vector does not have any parameter that affects the transition matrix, the computation times of the new and old gradient algorithms are expected to be comparable. Nevertheless, our initial testing suggests that the new algorithm is usually faster than the old algorithm even for such cases. When the parameter vector has some parameters that affect the transition matrix, the new algorithm is often significantly faster than the old one. To provide some idea about the performance gains due to the use of the new algorithm in practical scenarios, let me present the results of two small performance studies. In these studies, the timings of the gradient computation and the overall parameter estimation are compared for the old and the new algorithm under identical settings. Note that, due to finite-precision arithmetic the gradient vectors produced by these two mathematically equivalent algorithms can differ slightly. This in turn can lead to the following situations during the parameter estimation:

Nearly identical parameter estimates are obtained in the same number of iterations.
Nearly identical parameter estimates are obtained in a different number of iterations.
The parameter estimates differ significantly. This could be because in one or both cases the optimization process does not converge or converge to different local optima.

The first two situations are more common when the model and the data are reasonably in accord and the parameter estimation problem is well-posed. The issues included in the performance studies are chosen with this consideration.

The first performance study covers all the illustrative examples in the documentation for PROC CSSM, including the “Getting Started” example. In all, this study has 18 different parameter estimation problems. While relatively small in problem size, the illustrative examples cover various state space modeling scenarios. A summary of the findings for this study is as follows:

For all examples, nearly identical parameter estimates were obtained. In a few cases the number of iterations to achieve the optimum differed.
In 17 out of the 18 cases, the new algorithm computed the gradient faster than the old algorithm and for the one case that defied this pattern, the new algorithm was only marginally slower.
With the new algorithm, the combined time of parameter estimation for these 18 examples was less than fifty percent of the combined time of the old algorithm. That is, overall, more than twice the speedup was achieved.

The second study considered moderate-sized problems of diverse types from real-life settings. For SSMs, the computational cost of parameter estimation depends on factors such as, nObs, the number of observations in the data set, nParm, the size of the parameter vector, stateSize, the size of the latent state vector, and so on. In the comparison between the old and the new algorithm, nTParm, the number of parameters that affect the transition matrix, is also a key factor. In this study, the problems are chosen with different values for these factors. Its findings are summarized in Table1. The Speed-Up column in Table 1 shows the ratio of parameter estimation times between the old and the new algorithm. It shows that under a variety of situations the new algorithm provides a considerable speedup over the old algorithm.

Problem #	nObs	nParm	nTParm	stateSize	Speed-Up
1	1380	25	6	140	9.0x
2	6324	39	9	6	8.1x
3	1581	8	0	349	1.8x
4	4360	4	0	547	2.1x
5	2004	7	2	95	3.1x

In conclusion, I would love to hear about your experience, whether good or bad of working with this new change in PROC CSSM. Drop me a note at Rajesh Selukar@sas.com