The relevant column in my dataset are "gvkey" (a group id), datadate (ex. YYYYMMDD = 20100531), and revt.
The observation frequency is annual.
I want to to construct a rolling max of "revt" for the PREVIOUS 4 years for some "gvkey".
I don't want to hard code a window size of 5 because this would not handle missing values appropriately.
I'm very new the SAS so I'm not sure how this should be handled but I think SQL is the way to go.
For sake of clarity, I need something that's a little like
```data['5yr_revt_max'] = data.groupby("gvkey")['revt'].shift(1).rolling(5).max()```
Except this doesn't account for the years and is in Python
Then, assuming there are never more than one revt value per year for a given gvkey, you would want something like this:
proc sql;
create table want as
select
a.gvkey,
a.datadate,
max(b.revt) as maxRevt
from
have as a inner join
have as b on a.gvkey=b.gvkey and
b.datadate between intnx("YEAR", a.datadate, -4) and a.datadate
group by a.gvkey, a.datadate;
quit;
(untested)
So how do you want to handle missing years for a given gvkey? Return a missing max or the max of the non-missing years?
max from the set of remaining valid years. does that make sense?
Then, assuming there are never more than one revt value per year for a given gvkey, you would want something like this:
proc sql;
create table want as
select
a.gvkey,
a.datadate,
max(b.revt) as maxRevt
from
have as a inner join
have as b on a.gvkey=b.gvkey and
b.datadate between intnx("YEAR", a.datadate, -4) and a.datadate
group by a.gvkey, a.datadate;
quit;
(untested)
So I'd like to only look at the previous 5 years and exclude the current observation.
I thought that something like this might work
```
b.datadate between intnx("YEAR", a.datadate, -4) and intnx("YEAR", a.datadate, -1)
````
^^^ All I did was modify a.datadate to intnx("YEAR", a.datadate, -1) which it doesn't like...
What do you mean by "doesn't like". Does SAS object, or do you get unexpected results. If the latter (which I suspect), provide an example of unexpected results vs the data from which they were derived.
Be aware that the expression
b.datadate between intnx("YEAR", a.datadate, -4) and intnx("YEAR", a.datadate, -1)
generates time limits of 01jan 4 calendar years before and 01jan of last year. INTNX has to be told to align things, otherwise it defaults to the start of the resulting calendar year. This pushes back the upper limit of your time range. So to use this properly, you should use the "S" (for "same") parameter, as in:
b.datadate between intnx("YEAR", a.datadate, -4,'s') and intnx("YEAR", a.datadate, -1,'s')
Of course this changes the lower limit of your time range too, but maybe it didn't have any impact.
FINALLY, just don't bother with using INTNX for the upper range limit. Just use
b.datadate between intnx("YEAR", a.datadate, -4) and datadate-1;
which will produce the same results (as long as you don't have two records in less than 365 (or 366) days), and will avoid unnecessary use of the INTNX function.
The last option is the only one that worked. The ones with the `ntnx` command produced 0 observation tables. Why would that be?
Also for the last one, how does it know to subtract from the years? As far as I understand it views 20100531 as a big integer
Proc sql;
create table final as
select
a.gvkey,
a.datadate,
max(b.revt) as maxRevt
from
data as a inner join
data as b on a.gvkey=b.gvkey and
b.datadate between intnx("YEAR", a.datadate, -4) and a.datadate-1
group by a.gvkey, a.datadate;
quit;
It should return something... The problem may be elsewhere, please post more of your code.
Proc sql;
create table final as
select
a.gvkey,
a.datadate,
max(b.revt) as maxRevt
from
data as a inner join
data as b on a.gvkey=b.gvkey and
b.datadate between intnx("YEAR", a.datadate, -4) and intnx("YEAR", a.datadate, -1)
group by a.gvkey, a.datadate;
quit;
Figured out how to insert code haha.
If you want to avoid the intricacies of intnx and intck functions, you could simply do:
Proc sql;
create table final as
select
a.gvkey,
a.datadate,
max(b.revt) as maxRevt
from
data as a inner join
data as b on a.gvkey=b.gvkey and
year(b.datadate) between year( a.datadate) -4 and year( a.datadate) -1
group by a.gvkey, a.datadate;
quit;
Here's a data step solution that assumes your data are sorted by gvkey/datadate (no data, untested):
data want (drop=_:);
set have;
by gvkey;
array rev_array {0:4} _temporary_; /* for 0mod5 through 4mod5*/
cur_yyyy=year(datadate);
if first.gvkey then call missing(of rev_array{*});
/* For any missing year (and current year), set rev_array element to missing */
do _yyyy=lag(_cur_yyyy)+1 to _cur_yyyy;
if first.gvkey=0 then rev_array{mod(_yyyy-1,5}}=.;
end;
max_revt_prior4=max(of rev_array{*});
rev_array{mod(_cur_yyyy-1,5}=revt;
run;
The idea here is to:
1. Find any holes following the year of the previous record. Set the correspond array elements in rev_array (and also the array element for the CURRENT record) to missing.
2. Get the max of the array, which will be the max of the non-missing values among the preceding 4 years.
3. After the max has been established, put the current year REVT into the array, making it available for the next 4 years.
Note the array rev_array is indexed from 0 through 4, to conform to the result of the mod(_yyyy,5) function.
Also heed the warning of @PGStats . If a company changes its fiscal year, it is possible to have two fiscal year reports (one of these "annual" reports won't cover 12 actual months) in the same calendar year. You will have two identical maximums for that calendar year. And all subsequent records will have a maximum ignoring the earlier datadate of the two.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.