Hi, I have a question regarding how to calculate three year moving or rolling sum for each ID in a panel dataset.
The variables of interest are as follows:
Year | ID | NUniquePatents | Sum of N unique Patents by ID and for a three year moving window |
1901 | A | 1 | 2 |
1901 | A | 1 | 2 |
1901 |
| 0 | 2 |
1915 | B | 1 | 1 |
1915 | A | 0 | 2 |
1920 | B | 1 | 2 |
1920 | A | 0 | 2 |
1924 | A | 1 | 2 |
1924 | A | 1 | 2 |
1924 | B | 0 | 2 |
1924 | A | 0 | 2 |
1936 | A | 0 | 2 |
1936 | B | 1 | 2 |
I am looking for a code to calculate the last column highlighted in red.
In column three-1 represents unique value and 0 represents non-unique value. I want to sum up these 0s and 1s for each ID during a three year moving window. Example of values in the last column: first time period: 1901, second time period: 1901+1915, third time period: 1901+1915+1920, fourth time period: 1915+1920+1924, fifth time period: 1920+1924+1936.
I have already calculated the third column-i.e. number of unique patents. Now, I want to sum the number of unique values by ID and by three year moving window.
So for example, from the above table, assume that 1901 is the first year in the dataset , then the sum of unique value should be 2 for A. In 1915, the unique value should once again be equal to 2 as there are no new unique values for A in this time period (1901 and 1915). In 1920, A has no unique values , so for this time period (1901, 1915 and 1920) the sum should still be 2. Now for 1924, A has 2 unique values, but the sum should be calculated for the previous two years window and the current year (i.e. sum of A for years 1924, 1920 and 1915). In 1924, A has 2, 1920 A has 0 and in 1915 A has 0. Hence the sum of A for 1924 will be 2. To calculate value of A for 1936, sum of 1936, 1924 and 1920 should be considered, hence A has a value of 0 in 1936, in 1924 it has a unique value of 2 and it has a value of 0 in 1920. Hence for 1936 A is 0+2+0 which is 2. I am looking at the third column when I count the number of unique values. The same is to be done for B and other IDs.
I did try using proc SQL in SAS with a group by statement but it didn't produce the desired nuanced result. Please let me know if any of you can think of a solution. Thank you for your help.
Here is a solution using proc expand:
data have;
input Year ID$ NUniquePatents;
datalines;
1901 A 1 2
1901 A 1 2
1901 A 0 2
1915 B 1 1
1915 A 0 2
1920 B 1 2
1920 A 0 2
1924 A 1 2
1924 A 1 2
1924 B 0 2
1924 A 0 2
1936 A 0 2
1936 B 1 2
;
proc sql;
create table temp as
select id, year, sum(nuniquepatents) as n
from have
group by id, year;
quit;
proc expand data=temp out=sums(drop=TIME);
by id;
convert n=nsum / transformout=(movsum 3);
run;
proc sql;
create table want as
select a.*, b.nsum
from
have as a inner join
sums as b on a.id=b.id and a.year=b.year;
drop table temp, sums;
quit;
How is it that for 1915/A you have a three-year sum of 2, when you have no patents within 2 years of 1915.
Also, are you thinking of 3-year trailing windows, centered-windows, or leading-windows?
Oops forget it.
I think you need a series of steps:
Here's an untested version:
proc summary data=have noprint nway completetypes;
class id year;
var nuniquepatents;
output out=need (drop=_type_ _freq_) sum=yearly_count;
run;
data need2 ;
set need;
by id ;
if yearly_count=. then yearly_count=0;
three_year_count= yearly_count + ifn(lag(id)=id,lag(yearly_count),0) + ifn(lag2(id)=id,lag2(yearly_count),0);
run;
data want;
set have need2 (obs=0);
if _n_=1 then do;
declare hash h (dataset:'need2');
h.definekey('id','year');
h.definedata('yearly_count','three_year_count');
h.definedone();
end;
h.find();
run;
Hi @mkeintz
Thanks for you help. I ran the body of codes you suggested and the yearly count is calculated correctly by ID but there a couple of issues with three year count variable
Maybe I didn't explain it clearly. Sorry if that was the case. Considering the table below as an example:
Year | ID | Unique Class | NUniquePatents | Yearly Count | Three Year Count |
1901 | A | 56 | 1 | 3 | 3 |
1901 | A | 56 | 1 | 3 | 3 |
1901 |
| 56 | 1 | 3 | 3 |
1915 | A | 56 | 1 | 2 | 5 |
1915 | A | 56 | 1 | 2 | 5 |
1924 | A | 56 | 1 | 4 | 9 |
1924 | A | 56 | 1 | 4 | 9 |
1924 | A | 56 | 1 | 4 | 9 |
1924 | A | 56 | 1 | 4 | 9 |
1936 | A | 56 | 0 | 0 | 6 |
1936 | A | 56 | 0 | 0 | 6 |
1. The first issue with the code is that after three years the code resets the sum, but it should be rolling. If we look at the table above for A, the count of nuniuniquepatents is 0 for 1936 but while the yearly count can be zero, the three year count should take into account sum of 1936+1924+1915, so the 3 year count should be 0+4+2, i.e. 6 as shown in the highlighted last row
2. The second issue is that the three year count window is always supposed to take into consideration the current year + the previous two years. Only for the very first and second observations in the dataset, as no previous years are available -the three year count can take into account only the current year for first observation and the current and previous year for second observation. But from n3 onwards, for no other observations, the count should follow this logic.
So the count should resemble something similar to this: n1, n1+n2, n1+n2+n3, n2+n3+n4, n3+n4+n5, n4+n5+n6...........n10000000+n10000001+n10000002
But thank you for your help and sorry if there was any miscommunication before.
Here is a solution using proc expand:
data have;
input Year ID$ NUniquePatents;
datalines;
1901 A 1 2
1901 A 1 2
1901 A 0 2
1915 B 1 1
1915 A 0 2
1920 B 1 2
1920 A 0 2
1924 A 1 2
1924 A 1 2
1924 B 0 2
1924 A 0 2
1936 A 0 2
1936 B 1 2
;
proc sql;
create table temp as
select id, year, sum(nuniquepatents) as n
from have
group by id, year;
quit;
proc expand data=temp out=sums(drop=TIME);
by id;
convert n=nsum / transformout=(movsum 3);
run;
proc sql;
create table want as
select a.*, b.nsum
from
have as a inner join
sums as b on a.id=b.id and a.year=b.year;
drop table temp, sums;
quit;
Hi @StevenR Here is my stab at it using SQL
data have;
input Year ID$ NUniquePatents;
datalines;
1901 A 1 2
1901 A 1 2
1901 A 0 2
1915 B 1 1
1915 A 0 2
1920 B 1 2
1920 A 0 2
1924 A 1 2
1924 A 1 2
1924 B 0 2
1924 A 0 2
1936 A 0 2
1936 B 1 2
;
/*sequence number for years and get row number*/
data temp;
set have ;
by year;
if first.year then n+1;
rn+1;
run;
proc sql;
/*summarize by year,id*/
create table temp1 as
select year,id,n ,sum(NUniquePatents) as t
from temp
group by year,id,n;
/*Merge the summary on the rolling logic year between n-2 and n*/
create table want(drop=rn n) as
select a.*,sum
from temp a left join
(select a.year,a.id,sum( b.t) as sum
from temp1 a left join temp1 b
on a.id=b.id and a.n-2<=b.n<=a.n
group by a.year,a.id) b
on a.id=b.id and a.year=b.year
order by rn;
quit;
Hi @StevenR Since Year values are simple integers rolling computations may find home in Hash too. I had some fun. The only requirement upfront is to sort by ID Year. Once that's done, it's a mere BY Group processing that uses a concept I deem as "Park and Pick".
data have;
input Year ID$ NUniquePatents;
datalines;
1901 A 1 2
1901 A 1 2
1901 A 0 2
1915 B 1 1
1915 A 0 2
1920 B 1 2
1920 A 0 2
1924 A 1 2
1924 A 1 2
1924 B 0 2
1924 A 0 2
1936 A 0 2
1936 B 1 2
;
proc sort data=have out=temp;
by id year;
run;
data want;
if _n_=1 then do;
dcl hash H () ;
h.definekey ("year") ;
h.definedata ("_s") ;
h.definedone () ;
end;
do until(last.id);
set temp;
by id year;
if first.year then _s=0;
_s=sum(_s, NUniquePatents);
if last.year then h.add();
end;
array t(0:2) _temporary_;
do until(last.id);
set temp;
by id year;
if first.year then do;
_n+1;
_n1=mod(_n,3);
h.find();
t(_n1)=_s;
sum=sum(of t(*));
end;
output;
end;
h.clear();
call missing(of t(*),of _:);
drop _:;
run;
proc print noobs;
run;
Sorry seeing this message now, thank you for the detailed reply 🙂
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.