BookmarkSubscribeRSS Feed
Waynerun
Calcite | Level 5

Hi all,

 

I am relatively new to SAS and could need some help regarding seasonality tests.

 

More specifically, I need to test for seasonality in quarterly cash flows. My data set contains hundreds of different firms and looks approximately like this:

 

id     year    Q1      Q2       Q3      Q4       Full Year 

a     1999      5         8         12       3               28

a     2000      3         3         1         7               14

a     2001      6         8         4         2                20

a     2002      8         6         3         5                22

...

b     2003      5         8         7         2               22

b     2004      3         1         2         7               13

b     2005      6         5         4         2                17

b     2006      9         6         3         5                23

...

c     2000      2         8         9         3               22

c     2001      3         3         1         5               12

c     2002      3         5         4         9                21

c     2003      1         6         8         5                20

 ...

 

and so on and so forth. 

 

I tried to get Seasonal ACF with PROC X13 but it didn't work out.

 

Does anyone have a simple but effective idea on how to test for seasonality?

 

Thank you very much in advance!

 

Best,

Waynerun

6 REPLIES 6
ballardw
Super User

"Didn't work out" is awful vague.

Are there errors in the log?: Post the code and log in a code box opened with the "<>" to maintain formatting of error messages.

No output? Post any log in a code box.

Unexpected output? Provide input data in the form of data step code pasted into a code box, the actual results and the expected results. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the "<>" icon or attached as text to show exactly what you have and that we can test code against.

 

First, I think you need to reshape your data so you 1) have actual date values and 2) one observation per company /date

So that instead of

id     year    Q1      Q2       Q3      Q4       Full Year 
a     1999      5         8         12       3      28

You have something that looks more like:

id date       value
a  01Jan1999   5
a  01Apr1999   8
a  01Jul1999   12
a  01Oct199    3

One hint that you may need such is the Proc option DATE= which indicates the name of the variable in the data set that contains dates.

That is easy to do:

data want;
   set have;
   array v(*) q1-q4;
   do m=1 to 4;
      date=mdy(m,1,year);
      value=v[m];
      output;
   end;
   format date yymmdd10.;
   keep id date value;
run;

Suggestion: Use a subset of the  (where=(id='a')) or similar data set option on the data set to get one model working.

Then with the data sorted by ID and Date, use a BY ID; to get all the companies.

 

I am not sure exactly what you expect for output and don't use the ETS procs very much, so you may need to provide a bit more detail

Waynerun
Calcite | Level 5

Thank you very much @ballardw and apologies for my vague description.

My previous approach was so flawed that it didn't make sense to post it in the chat.

 

I rearranged my data set according to your example. Now I have the problem that some firms do not have complete time series.

For example

 

id          date                     Value

a           01Jan1998             5

a           01Apr1998             8

a           01Jul1998             10

a           01Oct1998             3

a           01Jan2001             2

a           01Apr2001             6

a           01Jul2001             7

a           01Oct2001             9

 

Is there a procedure I can use to delete all observations after a break occurs in my time series?

 

 

Regarding the final output, I was thinking to test seasonality by looking at the autocorrelation function. So, I would do this for all firms in the sample individually and then calculate a mean across all firms.  Similar to the following picture.

 

Waynerun_0-1663524041269.png

 

But if there is another way to test seasonality in my cross-sectional data, this would be great as well!

 

Best,

Waynerun

 

 

mkeintz
PROC Star

@Waynerun wrote:

Thank you very much @ballardw and apologies for my vague description.

My previous approach was so flawed that it didn't make sense to post it in the chat.

 

I rearranged my data set according to your example. Now I have the problem that some firms do not have complete time series.

For example

 

id          date                     Value

a           01Jan1998             5

a           01Apr1998             8

a           01Jul1998             10

a           01Oct1998             3

a           01Jan2001             2

a           01Apr2001             6

a           01Jul2001             7

a           01Oct2001             9

 

Is there a procedure I can use to delete all observations after a break occurs in my time series?

 

So you want to censor a time series when a firm encounters its first missing quarter, correct?

Given your data are sorted by ID/DATE, you could do something like this:

 

data need/ view=need;
  set have;
  by id date;

  retain missing_quarter $1;
  if intck('qtr',lag(date),date)^=1 then missing_quarter='Y');
  if first.id then missing_quarter='N';
 
  if missing_quarter='N';
run;

Questions:

 

Presumably you are not concerned about some firms having shorter series than others.  But if you had a firm covering your entire study period, but missing the 3rd quarter of the first year, you would then only have a "series" of two quarters.  Is that acceptable?   What about the longer series that would be lost to the censoring operation in such a case?   Do you want to require a minimum series size?   Do you want to create two "subseries" for a firm with a single missing quarter?

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Waynerun
Calcite | Level 5

Hi @mkeintz, thank you very much for your contribution!

 

The code you suggested worked perfectly 🙂

 

Just for better understanding. What does "^=1" mean?

 

The questions you have raised are also good food for thought. First, you are right. It doesn't matter if the companies have different time series. Some firms were established in 1989, others years later. So that's not a problem.

 

Secondly, the data are prepared in such a way that each company always starts with the first quarter of a fiscal year and ends with the fourth quarter of a fiscal year. So it should not be possible to have missing values only for Q3. If the value for Q3 is missing (e.g. in 1993), the entire year is missing. However, this could be a very strict approach and could be reconsidered.

 

Furthermore, I have reviewed my new data set, and indeed there is often a break in the early years of my time series. Therefore, censoring the data after the break would eliminate most of the company's observations, as you suspected. Is there any way to determine which of the two (or perhaps even three) subseries is the longest? If so, it might be a good idea to use the longest subseries for further analysis. 

 

For a minimum series size, 12 consecuitive observations (3 years) could be suitable as this would correspond to the 12 lags of the autocorrelation function. I think that since the data set is large enough to get meaningful results, it is not necessary to create two subseries.

 

I look forward to hearing your and others' thoughts on this topic and how to test best for seasonality!

 

Best,

Waynerun

mkeintz
PROC Star

@Waynerun wrote:

 

Just for better understanding. What does "^=1" mean?

 

^=1 means "not equal to 1".  So "intck('qtr',date,nxt_date)^=1" identifies instances in which the next date is not exactly 1 quarter from the current date.

 

... Is there any way to determine which of the two (or perhaps even three) subseries is the longest? If so, it might be a good idea to use the longest subseries for further analysis. 

To achieve that goal, you need to completely read each ID twice: the first time to determine the sizes of all subseries, and which one is the largest, and the second time to reread the id, and output those observations corresponding to the largest subseries.  The code below is designed to do that, but it is untested in the absence of sample data in the form of a working SAS data step:

%let min_subseries_size=12;

data need (drop=i _: nxt_date) /view=need;

  call missing(_maxsize,_maxid);  /*maximum subseries size for this id, and its corresponding subseries id*/

  do _series_id=1 by 1 until (until (last.id);
    do _size=1 by 1 until (last.id=1 or intck('qtr',date,nxt_date)^=1);
      set have (keep=id);
      by id;
      merge have  have (firstobs=2 keep=date rename=(date=nxt_date));
      if intck 'qtr',date,nxt_date)^=1 then leave;
    end;
    if _size>_maxsize then do;
      _maxsize=_size;
      _maxid=_series_id;
    end;
  end;

  do _series_id=1 by 1 until (until (last.id);
    do until (last.id=1 or intck('qtr',date,nxt_date)^=1);
      set have;
      by id;
      if _series_id=_max_id and _maxsize>=&min_subseries_size then output;
    end;
  end;

run;

 

Note: if two subseries have the same size, this code selects the earlier one.  If you prefer the later one, then change

    if _size>_maxsize then do;

to

    if _size>=_maxsize then do;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Waynerun
Calcite | Level 5

Hi @mkeintz,

 

your help is greatly appreciated! 

 

The code seems to work in general but there are a few issues I cannot resolve.

That's what my output looks like:

 

Waynerun_0-1663788630744.png

gvkey is my id and length corresponds to size.

 

1. The cells marked in yellow contain the same id, which shouldn't be the case, right?

 

2. I do know what the maximum length of a subseries is, but I do not know when it starts and when it ends.

 

3. How do I transfer the information I got in this step to my original data set, which contains all my observations with quarterly cash flows. Somehow I now need to delete all the sub-series in my original data set that are not the longest of a firm. 

 

Once I managed this, I can proceed with the output.

 

Maybe you have another idea on how to tackle this. Thanks again for all the help!

 

Best,

Waynerun

 

 

 

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 541 views
  • 0 likes
  • 3 in conversation