Solved: Re: Aggregating or summing over three year rolling window for each ID

StevenR · Posted 04-11-2020 03:33 PM

Hi, I have a question regarding how to calculate three year moving or rolling sum for each ID in a panel dataset.

The variables of interest are as follows:

Year	ID	NUniquePatents	Sum of N unique Patents by ID and for a three year moving window
1901	A	1	2
1901	A	1	2
1901	A	0	2
1915	B	1	1
1915	A	0	2
1920	B	1	2
1920	A	0	2
1924	A	1	2
1924	A	1	2
1924	B	0	2
1924	A	0	2
1936	A	0	2
1936	B	1	2

I am looking for a code to calculate the last column highlighted in red.

In column three-1 represents unique value and 0 represents non-unique value. I want to sum up these 0s and 1s for each ID during a three year moving window. Example of values in the last column: first time period: 1901, second time period: 1901+1915, third time period: 1901+1915+1920, fourth time period: 1915+1920+1924, fifth time period: 1920+1924+1936.

I have already calculated the third column-i.e. number of unique patents. Now, I want to sum the number of unique values by ID and by three year moving window.

So for example, from the above table, assume that 1901 is the first year in the dataset , then the sum of unique value should be 2 for A. In 1915, the unique value should once again be equal to 2 as there are no new unique values for A in this time period (1901 and 1915). In 1920, A has no unique values , so for this time period (1901, 1915 and 1920) the sum should still be 2. Now for 1924, A has 2 unique values, but the sum should be calculated for the previous two years window and the current year (i.e. sum of A for years 1924, 1920 and 1915). In 1924, A has 2, 1920 A has 0 and in 1915 A has 0. Hence the sum of A for 1924 will be 2. To calculate value of A for 1936, sum of 1936, 1924 and 1920 should be considered, hence A has a value of 0 in 1936, in 1924 it has a unique value of 2 and it has a value of 0 in 1920. Hence for 1936 A is 0+2+0 which is 2. I am looking at the third column when I count the number of unique values. The same is to be done for B and other IDs.

I did try using proc SQL in SAS with a group by statement but it didn't produce the desired nuanced result. Please let me know if any of you can think of a solution. Thank you for your help.

PGStats · Posted 04-11-2020 05:55 PM

Here is a solution using proc expand:

data have;
input Year    ID$  NUniquePatents;
datalines;
1901    A   1   2
1901    A   1   2
1901    A   0   2
1915    B   1   1
1915    A   0   2
1920    B   1   2
1920    A   0   2
1924    A   1   2
1924    A   1   2
1924    B   0   2
1924    A   0   2
1936    A   0   2
1936    B   1   2
;

proc sql;
create table temp as
select id, year, sum(nuniquepatents) as n
from have
group by id, year;
quit;

proc expand data=temp out=sums(drop=TIME);
by id;
convert n=nsum / transformout=(movsum 3);
run;

proc sql;
create table want as
select a.*, b.nsum
from
	have as a inner join
	sums as b on a.id=b.id and a.year=b.year;
drop table temp, sums;
quit;

PG

View solution in original post

mkeintz · Posted 04-11-2020 03:54 PM

~~How is it that for 1915/A you have a three-year sum of 2, when you have no patents within 2 years of 1915.~~

~~Also, are you thinking of 3-year trailing windows, centered-windows, or leading-windows?~~

Oops forget it.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 04-11-2020 04:29 PM

I think you need a series of steps:

Run a proc (proc summary here) that gives the totals of nuniquepatents for each year/id, call it variable yearly_count.
sort it by id/year, and then generate a trailing three "year" window summing yearly_count into three_year_count.
Use the output of #2 as a lookup table as you re-read and output the original data set.

Here's an untested version:

proc summary data=have noprint nway completetypes;
  class id year;
  var nuniquepatents;
  output out=need (drop=_type_ _freq_) sum=yearly_count;
run;

data need2 ;
  set need;
  by id ;
  if yearly_count=. then yearly_count=0;
  three_year_count= yearly_count + ifn(lag(id)=id,lag(yearly_count),0) + ifn(lag2(id)=id,lag2(yearly_count),0);
run;

data want;
  set have need2 (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'need2');
      h.definekey('id','year');
      h.definedata('yearly_count','three_year_count');
      h.definedone();
  end;
  h.find();
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

StevenR · Posted 04-11-2020 05:53 PM

Hi @mkeintz

Thanks for you help. I ran the body of codes you suggested and the yearly count is calculated correctly by ID but there a couple of issues with three year count variable

Maybe I didn't explain it clearly. Sorry if that was the case. Considering the table below as an example:

Year	ID	Unique Class	NUniquePatents	Yearly Count	Three Year Count
1901	A	56	1	3	3
1901	A	56	1	3	3
1901	A	56	1	3	3
1915	A	56	1	2	5
1915	A	56	1	2	5
1924	A	56	1	4	9
1924	A	56	1	4	9
1924	A	56	1	4	9
1924	A	56	1	4	9
1936	A	56	0	0	6
1936	A	56	0	0	6

1. The first issue with the code is that after three years the code resets the sum, but it should be rolling. If we look at the table above for A, the count of nuniuniquepatents is 0 for 1936 but while the yearly count can be zero, the three year count should take into account sum of 1936+1924+1915, so the 3 year count should be 0+4+2, i.e. 6 as shown in the highlighted last row

2. The second issue is that the three year count window is always supposed to take into consideration the current year + the previous two years. Only for the very first and second observations in the dataset, as no previous years are available -the three year count can take into account only the current year for first observation and the current and previous year for second observation. But from n3 onwards, for no other observations, the count should follow this logic.

So the count should resemble something similar to this: n1, n1+n2, n1+n2+n3, n2+n3+n4, n3+n4+n5, n4+n5+n6...........n10000000+n10000001+n10000002

But thank you for your help and sorry if there was any miscommunication before.

PGStats · Posted 04-11-2020 05:55 PM

Here is a solution using proc expand:

data have;
input Year    ID$  NUniquePatents;
datalines;
1901    A   1   2
1901    A   1   2
1901    A   0   2
1915    B   1   1
1915    A   0   2
1920    B   1   2
1920    A   0   2
1924    A   1   2
1924    A   1   2
1924    B   0   2
1924    A   0   2
1936    A   0   2
1936    B   1   2
;

proc sql;
create table temp as
select id, year, sum(nuniquepatents) as n
from have
group by id, year;
quit;

proc expand data=temp out=sums(drop=TIME);
by id;
convert n=nsum / transformout=(movsum 3);
run;

proc sql;
create table want as
select a.*, b.nsum
from
	have as a inner join
	sums as b on a.id=b.id and a.year=b.year;
drop table temp, sums;
quit;

PG

StevenR · Posted 04-11-2020 06:50 PM

Hi, thank you so much for your help, the entire code executed perfectly!

novinosrin · Posted 04-11-2020 06:32 PM

Hi @StevenR Here is my stab at it using SQL

data have;
input Year    ID$  NUniquePatents;
datalines;
1901    A   1   2
1901    A   1   2
1901    A   0   2
1915    B   1   1
1915    A   0   2
1920    B   1   2
1920    A   0   2
1924    A   1   2
1924    A   1   2
1924    B   0   2
1924    A   0   2
1936    A   0   2
1936    B   1   2
;
/*sequence number for years and get row number*/
data temp;
 set have ;
 by year;
 if first.year then n+1;
 rn+1;
run;

proc sql;
/*summarize by year,id*/
create table temp1 as
select year,id,n ,sum(NUniquePatents) as t
from temp
group by year,id,n; 
/*Merge the summary on the rolling logic year between n-2 and n*/
create table want(drop=rn n) as
select a.*,sum
from temp a left join 
(select a.year,a.id,sum( b.t) as sum
from temp1 a left join temp1 b
on a.id=b.id and a.n-2<=b.n<=a.n
group by a.year,a.id) b
on a.id=b.id and a.year=b.year
order by rn;
quit;

novinosrin · Posted 04-11-2020 08:05 PM

Hi @StevenR Since Year values are simple integers rolling computations may find home in Hash too. I had some fun. The only requirement upfront is to sort by ID Year. Once that's done, it's a mere BY Group processing that uses a concept I deem as "Park and Pick".

data have;
input Year    ID$  NUniquePatents;
datalines;
1901    A   1   2
1901    A   1   2
1901    A   0   2
1915    B   1   1
1915    A   0   2
1920    B   1   2
1920    A   0   2
1924    A   1   2
1924    A   1   2
1924    B   0   2
1924    A   0   2
1936    A   0   2
1936    B   1   2
;

proc sort data=have out=temp;
by id year;
run;

data want;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("year") ;
   h.definedata ("_s") ;
   h.definedone () ;
 end;
 do until(last.id);
  set temp;
  by  id year;
  if first.year then _s=0;
   _s=sum(_s,	NUniquePatents);
  if last.year then h.add();
 end;
 array t(0:2) _temporary_;
 do until(last.id);
  set temp;
  by id year;
  if first.year then do;
   _n+1;
   _n1=mod(_n,3);
   h.find();
   t(_n1)=_s;
   sum=sum(of t(*));
  end;
  output;
 end;
 h.clear();
 call missing(of t(*),of _:);
 drop _:;
run;

proc print noobs;
run;

StevenR · Posted 06-18-2020 05:49 AM

Sorry seeing this message now, thank you for the detailed reply 🙂

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away