Hello everyone!
I am desperate to get a piece of advice after a lot of browsing around in SAS user guide and the Community. I use a SAS Enterprise guide 7.1 (64-bit)
MY time series has the following charachteristics:
1) daily data;
2) irregular (no weekends or holidays included) - so I have a 5 day week, but there are weeks which have 6 days, or holiday weeks that are shorter;
3) data display intraweek, monthly, quarterly and annual seasonality;
4) data reflect payments made by individuals.
I came across Lex Jansen paper whith a very clear overview of UCM. One of the examples is forecasting dow jones index with proc ucm. What I noticed is that the code implies a 5-day week. PLease see the code:
proc ucm data=dow plot=all; id date interval=weekday; model close; level; slope; season type=dummy length=5;
Now the reason I got interested is that when I make my data regular by applying proc expand and make it a 7 day week I seem to lose crucial info on consumers' behavioral patterns.
Plus my forecast yields very high standard error - 0,5% of the actual value (which is a lot in this case). What is more my forecast tends to have much longer harmonics, suggesting that the interpolation that proc expand offers makes the data inconsistent with its original pattern.
Here is my code:
%let days_to_predict = 5; %let dir=%sysfunc(C:/home/Far/); ods graphics on; goptions device= ACTXIMG; ods pdf file="&dir.Far.pdf"; PROC IMPORT DATAFILE= "&dir.UCM.xlsx" DBMS=XLSX OUT= ttt REPLACE; GETNAMES=YES; RUN; PROC EXPAND DATA=ttt OUT=ttt FROM = DAY ALIGN = BEGINNING METHOD = SPLINE(NOTAKNOT, NOTAKNOT) PLOT=(ALL SERIES) OBSERVED = (BEGINNING, BEGINNING); id date; CONVERT dough /; RUN; DATA ttt; set ttt; LENGTH date 8 dough 8 ; KEEP date dough ; FORMAT date DATE9. dough F12.4 ; INFORMAT date DATE9. dough BEST12. ; RUN; PROC SORT DATA=ttt(KEEP=date dough) OUT=ttt; BY date; RUN; data ttt; set ttt; wd = weekday(date); dy = day(date); b_ny = exp(-(MDY(12,31,Year(date))-date)**2/40); a_ny = exp(-(date-MDY(12,31,Year(date)-1))**2/40); **qt = qtr(date); **may = exp(-(MDY(5,5,Year(date))-date)**2/20); run; proc ucm data=ttt; id date interval=day; model dough=wd b_ny a_ny; **may qt; outlier maxnum=30; level plot=smooth; slope plot=smooth; season length=365 type=trig keeph=2 to 12 by 1 print=harmonics plot=(FILTER SMOOTH); cycle period=7 noest=(period); irregular p=3 q=3; estimate back=0 plot=panel; forecast skipfirst=3000 back=0 lead=&days_to_predict plot=decomp; run; ods graphics off; ods pdf close;
I have therefore 3 crucial questions:
1) Is there a way to work with irregular data using proc UCM?
2) How can I improve my forecast if I get longer harmonics with higher amplitude?
3) Is there a code that instead of the usual interpolation of PROC UCM would allow me to expand my dates to weekends and then copy the observation available on the previos working day. Say, I have irregular daily data (normal 5-day week with some exceptions for holiday season). I want to expand my dates to make it a 7 day week seasonality. Then for sundays and saturdays I would like to have the same value as it was observed on friday.
Again the priority is to learn how to deal with irregular time series. But if that is impossible I would take any advice that would help me build a more precise forecast.
First a few comments about your UCM code:
1. The length= in the season statement must be an integer.
2. Usually it is a good idea to include a simple noise component (IRREGULAR) in the model.
A good book for UCMs: Pelagatti, M. M. (2015). Time Series Modelling with Unobserved Components. Boca Raton, FL: CRC Press.
It is not easy for me to check your data pattern. Try to see if your data can be put in some "weekday" interval pattern (see the section https://go.documentation.sas.com/?docsetId=etsug&docsetTarget=etsug_intervals_toc.htm&docsetVersion=... ) supported by SAS. If your holidays appear within these intervals, you will need to insert them in your data (with missing value for your response, close). After this your series will be reasonably regular. At least initially, don't specify periods in your cycles (let the procedure estimate the period). Similarly, include the SEASON statement only if you have at least four complete seasons (why are you skipping the first harmonic?).
Dealing with data that may have complex seasonal or cyclical patterns can be difficult. If you are able to create a time series of equally spaced observations from your original time series, you can use the UCM procedure for such a task. You might need to insert new observations with missing response values or delete some observations (such as holidays) to ensure that successive observations are "equally" spaced. While doing this if the associated time ID variable cannot be assigned a proper date interval then you can just use the observation number as the time id. This process should not require any interpolation (e.g., the use of PROC EXPAND). Now you can try to explore the natural periodicities in your data, which might be different than 7 or 365. Let me know how this works.
Hello, dear rselukar!
I apologize for a late reply as I was busy trying ARDL modelling in SAS, that distracted my attention from PROC UCM. Thank You for the advice on using ID number as time id. I will try this option now.
As to your suggestion to try and build equally spaced series by inserting new observations or deleting holidays, I am afraid it is quite difficult to achieve. Every year working days calendar is modified subject to the weekday the holiday falls on. For example, if Independance falls on Thursday in a specific year, then there would be 4 days off (Thursday, Friday, Saturday and Sunday), but if in another year it falls on Tuesday, then there would be no vacation span all the way until Saturday and people would get only 1 day off (Tuesday). Therefore different years have slightly different length.
My original data excludes public holidays and weekends.
The problem that I see now is that if I use observation id as time id, I lose part of relevant info for ARIMA modelling that is a preliminary model estimation before proceeding to UCM.
My average week lasts 4.79 days, month - 20.58 days, a year - 51.2 weeks or 245.6 days. this is also something that I will introduce into current UCM parameters.
What is your assessment of the extent to which UCM estimates might be inconsistent due to standardized approach to time identification (i.e. assigning each observation a serial number, rather than real dates?
Is there a guide to building UCM by hand? I am not sure thar I understand the procedure of harmonics estimation. Understanding innerworkings of the model would help me customise it in accordance with certain tasks.
I am trying to see how best to answer your questions. Here are my comments:
1. The ARIMA, UCM, AUTOREG (and VARMAX for multivariate setting) procedures in SAS assume that the observations are collected at (logically) equally spaced time points. Therefore, the actual index variable used internally is always the observation number. The SAS time-ID variable, if supplied, is used only to label the observations (and to provide an additional check to see if the observations are properly ordered). In particular, this means that my suggestion to create a time series of equally spaced observations applies to ARIMA as well as UCM modeling (and for your ARDL modeling also).
2. I know it will be tedious to create a true equally spaced time series for your situation but it will be useful to come as close to it as easily possible (it is perfectly OK to have embedded missing response values if you are using PROC UCM or APROC ARIMA).
3. Once you have such a time series, you are ready to use PROC UCM. If you think that the series does not have seasonal pattern with integer period but has approximate periodic patterns then you can include one or more cycle components (start with one or two). Start with a smooth trend (such as local linear trend with disturbance variance of level set to zero). This helps in the identification of cycles. You can also use regression variables to take account of the holidays or other special events. Initially do not add ARMA orders in the IRREGULAR statement (ARMA component can act like a cycle component and complicate the cycle identification). After reasonable cycle components are identified, you can add lower order ARMA part (say p=1 or q=1) to the IRREGULAR.
Let's see if this works.
Hello, rselukar!
Thank you for the reply. I still struggle to understand what you mean by "introducing new observations with missing values" to create equally spaced time series. Here is what I am dealing with. The raw data does not include observations for holidays and weekends. Stage 1. Here is an example:
29.04.19 | 6619 | 28.04.18 | 6437 | 28.04.17 | 5631 |
30.04.19 | 6637 | 03.05.18 | 6381 | 02.05.17 | 5586 |
06.05.19 | 6583 | 04.05.18 | 6389 | 03.05.17 | 5585 |
07.05.19 | 6622 | 07.05.18 | 6388 | 04.05.17 | 5634 |
08.05.19 | 6650 | 08.05.18 | 6407 | 05.05.17 | 5702 |
13.05.19 | 6611 | 10.05.18 | 6434 | 10.05.17 | 5669 |
14.05.19 | 6614 | 11.05.18 | 6470 | 11.05.17 | 5700 |
15.05.19 | 6637 | 14.05.18 | 6476 | 12.05.17 | 5750 |
16.05.19 | 6665 | 15.05.18 | 6493 | 15.05.17 | 5741 |
This is an extract from the series. The sample shows data for the period with may holidays. As You can see the holidays make it difficult to create equally spaced time series.
Stage 2. To tackle the issue I tried to follow your advice by introducuing observations with missing values. Here is what I got
28.04.2017 | 5631,384 | 28.04.2018 | 6437,385 | 28.04.2019 | #NA |
29.04.2017 | #NA | 29.04.2018 | #NA | 29.04.2019 | 6619,458 |
30.04.2017 | #NA | 30.04.2018 | #NA | 30.04.2019 | 6637,063 |
01.05.2017 | #NA | 01.05.2018 | #NA | 01.05.2019 | #NA |
02.05.2017 | 5586,474 | 02.05.2018 | #NA | 02.05.2019 | #NA |
03.05.2017 | 5584,853 | 03.05.2018 | 6381,419 | 03.05.2019 | #NA |
04.05.2017 | 5633,576 | 04.05.2018 | 6389,251 | 04.05.2019 | #NA |
05.05.2017 | 5702,383 | 05.05.2018 | #NA | 05.05.2019 | #NA |
06.05.2017 | #NA | 06.05.2018 | #NA | 06.05.2019 | 6582,803 |
07.05.2017 | #NA | 07.05.2018 | 6388,149 | 07.05.2019 | 6621,953 |
08.05.2017 | #NA | 08.05.2018 | 6407,099 | 08.05.2019 | 6650,353 |
09.05.2017 | #NA | 09.05.2018 | #NA | 09.05.2019 | #NA |
10.05.2017 | 5669,212 | 10.05.2018 | 6434,303 | 10.05.2019 | #NA |
11.05.2017 | 5700,203 | 11.05.2018 | 6470,12 | 11.05.2019 | #NA |
12.05.2017 | 5750,052 | 12.05.2018 | #NA | 12.05.2019 | #NA |
13.05.2017 | #NA | 13.05.2018 | #NA | 13.05.2019 | 6611,123 |
14.05.2017 | #NA | 14.05.2018 | 6476,379 | 14.05.2019 | 6614,203 |
15.05.2017 | 5741,485 | 15.05.2018 | 6492,975 | 15.05.2019 | 6636,823 |
The original code works for the data but UCM forecast is poor from RMSE perspective. As an alternative I assumed there was no change of the dependent variable for the weekends.
Stage 3. So I copied the values of the working days previous to the days off. I could not find a code that does it in SAS, so I resorted to sumif function in excel:
27.04.17 | 5611 | 27.04.18 | 6422 | 27.04.19 | 6585 |
28.04.17 | 5631 | 28.04.18 | 6422 | 28.04.19 | 6585 |
29.04.17 | 5631 | 29.04.18 | 6422 | 29.04.19 | 6619 |
30.04.17 | 5631 | 30.04.18 | 6418 | 30.04.19 | 6637 |
01.05.17 | 5624 | 01.05.18 | 6418 | 01.05.19 | 6637 |
02.05.17 | 5586 | 02.05.18 | 6385 | 02.05.19 | 6637 |
03.05.17 | 5585 | 03.05.18 | 6381 | 03.05.19 | 6637 |
04.05.17 | 5634 | 04.05.18 | 6389 | 04.05.19 | 6637 |
05.05.17 | 5702 | 05.05.18 | 6389 | 05.05.19 | 6637 |
06.05.17 | 5702 | 06.05.18 | 6389 | 06.05.19 | 6583 |
07.05.17 | 5702 | 07.05.18 | 6388 | 07.05.19 | 6622 |
08.05.17 | 5698 | 08.05.18 | 6407 | 08.05.19 | 6650 |
09.05.17 | 5698 | 09.05.18 | 6407 | 09.05.19 | 6650 |
10.05.17 | 5669 | 10.05.18 | 6434 | 10.05.19 | 6650 |
11.05.17 | 5700 | 11.05.18 | 6470 | 11.05.19 | 6650 |
12.05.17 | 5750 | 12.05.18 | 6470 | 12.05.19 | 6650 |
13.05.17 | 5750 | 13.05.18 | 6470 | 13.05.19 | 6611 |
14.05.17 | 5750 | 14.05.18 | 6476 | 14.05.19 | 6614 |
15.05.17 | 5741 | 15.05.18 | 6493 | 15.05.19 | 6637 |
I ran proc UCM again but yet again I failed to improve my forecast.
Here are my questions with respect to my current situation:
1) Does "introducing observations with missing values" stand for what I did in stage 2?
2) What is the code for transforming original irregular date series into regular series with missing values?
3) Is there a code that would allow me to copy existing values of the working days previous to the days off? Anything similar to Excel sumif function in SAS?
4) Once you added holidays and weekends how do you assign index variable or SAS time-ID variable instead of the imported date variable? Please supply the code. I tried this:
PROC IMPORT DATAFILE= "&dir.decomp_UCM.xlsx" DBMS=XLSX OUT= ttt REPLACE; GETNAMES=YES; RUN; DATA ttt; set ttt; LENGTH date 8 cash 8 ; KEEP date cash ; FORMAT date DATE9. cash F12.4 ; INFORMAT date DATE9. cash BEST12. ; RUN; proc datasets library=work; modify ttt; index cash; run; PROC SORT DATA=ttt(KEEP=date cash_abs) OUT=ttt; BY date; RUN;
4) Suppose we have come to the point were the dataset is equally spaced. How do I specify the season and cycle?
The season parameter does not allow me to introduce numbers with decimals, requiring integers. You also mentioned that the cycle can be specified for both weekly and monthly patterns. Is the following code correct:
proc ucm data=ttt; id date interval=day; model cash; outlier maxnum=30; level plot=smooth; slope plot=smooth; season length=245.6 type=trig keeph=2 to 12 by 1 print=harmonics plot=(FILTER SMOOTH); cycle period=4.79 noest=(period); cycle period=20.58 noest=(period); estimate back=0 plot=panel; forecast skipfirst=3000 back=0 lead=&days_to_predict plot=decomp; run;
Still the automatic UCM program yields unsatisfactory results. I am desperate to get a good approximation. Would you suggest trying Lex Jansen UCM procedure by hand? Meaning building and estimating state and signal equations without resorting to automatic solution? Would you recommend any literature on that?
@rselukar wrote:I am trying to see how best to answer your questions. Here are my comments:
1. The ARIMA, UCM, AUTOREG (and VARMAX for multivariate setting) procedures in SAS assume that the observations are collected at (logically) equally spaced time points. Therefore, the actual index variable used internally is always the observation number. The SAS time-ID variable, if supplied, is used only to label the observations (and to provide an additional check to see if the observations are properly ordered). In particular, this means that my suggestion to create a time series of equally spaced observations applies to ARIMA as well as UCM modeling (and for your ARDL modeling also).
2. I know it will be tedious to create a true equally spaced time series for your situation but it will be useful to come as close to it as easily possible (it is perfectly OK to have embedded missing response values if you are using PROC UCM or APROC ARIMA).
3. Once you have such a time series, you are ready to use PROC UCM. If you think that the series does not have seasonal pattern with integer period but has approximate periodic patterns then you can include one or more cycle components (start with one or two). Start with a smooth trend (such as local linear trend with disturbance variance of level set to zero). This helps in the identification of cycles. You can also use regression variables to take account of the holidays or other special events. Initially do not add ARMA orders in the IRREGULAR statement (ARMA component can act like a cycle component and complicate the cycle identification). After reasonable cycle components are identified, you can add lower order ARMA part (say p=1 or q=1) to the IRREGULAR.
Let's see if this works.
If You can please provide a code to illustrate your suggestions.
First a few comments about your UCM code:
1. The length= in the season statement must be an integer.
2. Usually it is a good idea to include a simple noise component (IRREGULAR) in the model.
A good book for UCMs: Pelagatti, M. M. (2015). Time Series Modelling with Unobserved Components. Boca Raton, FL: CRC Press.
It is not easy for me to check your data pattern. Try to see if your data can be put in some "weekday" interval pattern (see the section https://go.documentation.sas.com/?docsetId=etsug&docsetTarget=etsug_intervals_toc.htm&docsetVersion=... ) supported by SAS. If your holidays appear within these intervals, you will need to insert them in your data (with missing value for your response, close). After this your series will be reasonably regular. At least initially, don't specify periods in your cycles (let the procedure estimate the period). Similarly, include the SEASON statement only if you have at least four complete seasons (why are you skipping the first harmonic?).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.