BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
StevenR
Obsidian | Level 7

Hi, I have a question regarding how to calculate three year moving or rolling sum for each ID in a panel dataset. 

The variables of interest are as follows:

Year

ID

NUniquePatents

Sum of N unique Patents by ID and for a three year moving window

1901

A

1

2

1901

A

1

2

1901


A

0

2

1915

B

1

1

1915

A

0

2

1920

B

1

2

1920

A

0

2

1924

A

1

2

1924

A

1

2

1924

B

0

2

1924

A

0

2

1936

A

0

2

1936

B

1

2

 

I am looking for a code to calculate the last column highlighted in red.

In column three-1 represents unique value and 0 represents non-unique value. I want to sum up these 0s and 1s for each ID during a three year moving window. Example of values in the last column: first time period: 1901, second time period: 1901+1915, third time period: 1901+1915+1920, fourth time period: 1915+1920+1924, fifth time period: 1920+1924+1936. 

 

I have already calculated the third column-i.e. number of unique patents. Now, I want to sum the number of unique values by ID and by three year moving window.

 

So for example, from the above table, assume that 1901 is the first year in the dataset , then the sum of unique value should be 2 for A. In 1915, the unique value should once again be equal to 2 as there are no new unique values for A in this time period (1901 and 1915). In 1920, A has no unique values , so for this time period (1901, 1915 and 1920) the sum should still be 2. Now for 1924, A has 2 unique values, but the sum should be calculated for the previous two years window and the current year (i.e. sum of A for years 1924, 1920 and 1915). In 1924, A has 2, 1920 A has 0 and in 1915 A has 0. Hence the sum of A for 1924 will be 2. To calculate value of A for 1936, sum of 1936, 1924 and 1920 should be considered, hence A has a value of 0 in 1936, in 1924 it has a unique value of 2 and it has a value of 0 in 1920. Hence for 1936 A is 0+2+0 which is 2. I am looking at the third column when I count the number of unique values. The same is to be done for B and other IDs. 

 

I did try using proc SQL in SAS with a group by statement but it didn't produce the desired nuanced result. Please let me know if any of you can think of a solution. Thank you for your help. 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Here is a solution using proc expand:

 

data have;
input Year    ID$  NUniquePatents;
datalines;
1901    A   1   2
1901    A   1   2
1901    A   0   2
1915    B   1   1
1915    A   0   2
1920    B   1   2
1920    A   0   2
1924    A   1   2
1924    A   1   2
1924    B   0   2
1924    A   0   2
1936    A   0   2
1936    B   1   2
;

proc sql;
create table temp as
select id, year, sum(nuniquepatents) as n
from have
group by id, year;
quit;

proc expand data=temp out=sums(drop=TIME);
by id;
convert n=nsum / transformout=(movsum 3);
run;

proc sql;
create table want as
select a.*, b.nsum
from
	have as a inner join
	sums as b on a.id=b.id and a.year=b.year;
drop table temp, sums;
quit;
PG

View solution in original post

8 REPLIES 8
mkeintz
PROC Star

How is it that for 1915/A you have a three-year sum of 2, when you have no patents within 2 years of 1915.

 

Also, are you thinking of 3-year trailing windows, centered-windows, or leading-windows?

 

Oops forget it.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

I think you need a series of steps:

  1. Run a proc (proc summary here) that gives the totals of nuniquepatents for each year/id, call it variable yearly_count.
  2. sort it by id/year, and then generate a trailing three "year" window summing yearly_count into three_year_count.
  3. Use the output of #2 as a lookup table as you re-read and output the original data set.

 

Here's an untested version:

proc summary data=have noprint nway completetypes;
  class id year;
  var nuniquepatents;
  output out=need (drop=_type_ _freq_) sum=yearly_count;
run;

data need2 ;
  set need;
  by id ;
  if yearly_count=. then yearly_count=0;
  three_year_count= yearly_count + ifn(lag(id)=id,lag(yearly_count),0) + ifn(lag2(id)=id,lag2(yearly_count),0);
run;

data want;
  set have need2 (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'need2');
      h.definekey('id','year');
      h.definedata('yearly_count','three_year_count');
      h.definedone();
  end;
  h.find();
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
StevenR
Obsidian | Level 7

Hi @mkeintz 

Thanks for you help. I ran the body of codes you suggested and  the yearly count is calculated correctly by ID but there a couple of issues with three year count variable

 

Maybe I didn't explain it clearly. Sorry if that was the case. Considering the table below as an example:

Year

ID

Unique Class

NUniquePatents

Yearly Count

Three Year Count

1901

A

56

1

3

3

1901

A

56

1

3

3

1901


A

56

1

3

3

1915

A

56

1

2

5

1915

A

56

1

2

5

1924

A

56

1

4

9

1924

A

56

1

4

9

1924

A

56

1

4

9

1924

A

56

1

4

9

1936

A

56

0

0

6

1936

A

56

0

0

6


1. The first issue with the code is that after three years the code resets the sum, but it should be rolling. If we look at the table above for A, the count of nuniuniquepatents is 0 for 1936 but while the yearly count can be zero, the three year count should take into account sum of 1936+1924+1915, so the 3 year count should be 0+4+2, i.e. 6 as shown in the highlighted last row

2. The second issue is that the three year count window is always supposed to take into consideration the current year + the previous two years. Only for the very first and second observations in the dataset, as no previous years are available -the three year count can take into account only the current year for first observation and the current and previous year for second observation. But from n3 onwards, for no other observations, the count should follow this logic.

 

So the count should resemble something similar to this: n1, n1+n2, n1+n2+n3, n2+n3+n4, n3+n4+n5, n4+n5+n6...........n10000000+n10000001+n10000002

 

But thank you for your help and sorry if there was any miscommunication before.

 

 

 

PGStats
Opal | Level 21

Here is a solution using proc expand:

 

data have;
input Year    ID$  NUniquePatents;
datalines;
1901    A   1   2
1901    A   1   2
1901    A   0   2
1915    B   1   1
1915    A   0   2
1920    B   1   2
1920    A   0   2
1924    A   1   2
1924    A   1   2
1924    B   0   2
1924    A   0   2
1936    A   0   2
1936    B   1   2
;

proc sql;
create table temp as
select id, year, sum(nuniquepatents) as n
from have
group by id, year;
quit;

proc expand data=temp out=sums(drop=TIME);
by id;
convert n=nsum / transformout=(movsum 3);
run;

proc sql;
create table want as
select a.*, b.nsum
from
	have as a inner join
	sums as b on a.id=b.id and a.year=b.year;
drop table temp, sums;
quit;
PG
StevenR
Obsidian | Level 7
Hi, thank you so much for your help, the entire code executed perfectly!
novinosrin
Tourmaline | Level 20

Hi @StevenR  Here is my stab at it using SQL

data have;
input Year    ID$  NUniquePatents;
datalines;
1901    A   1   2
1901    A   1   2
1901    A   0   2
1915    B   1   1
1915    A   0   2
1920    B   1   2
1920    A   0   2
1924    A   1   2
1924    A   1   2
1924    B   0   2
1924    A   0   2
1936    A   0   2
1936    B   1   2
;
/*sequence number for years and get row number*/
data temp;
 set have ;
 by year;
 if first.year then n+1;
 rn+1;
run;

proc sql;
/*summarize by year,id*/
create table temp1 as
select year,id,n ,sum(NUniquePatents) as t
from temp
group by year,id,n; 
/*Merge the summary on the rolling logic year between n-2 and n*/
create table want(drop=rn n) as
select a.*,sum
from temp a left join 
(select a.year,a.id,sum( b.t) as sum
from temp1 a left join temp1 b
on a.id=b.id and a.n-2<=b.n<=a.n
group by a.year,a.id) b
on a.id=b.id and a.year=b.year
order by rn;
quit;

 

results.PNG

novinosrin
Tourmaline | Level 20

Hi @StevenR  Since Year values are simple integers rolling computations may find home in Hash too. I had some fun. The only requirement upfront is to sort by ID Year. Once that's done, it's a mere BY Group processing that uses a concept I deem as "Park and Pick".

data have;
input Year    ID$  NUniquePatents;
datalines;
1901    A   1   2
1901    A   1   2
1901    A   0   2
1915    B   1   1
1915    A   0   2
1920    B   1   2
1920    A   0   2
1924    A   1   2
1924    A   1   2
1924    B   0   2
1924    A   0   2
1936    A   0   2
1936    B   1   2
;

proc sort data=have out=temp;
by id year;
run;

data want;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("year") ;
   h.definedata ("_s") ;
   h.definedone () ;
 end;
 do until(last.id);
  set temp;
  by  id year;
  if first.year then _s=0;
   _s=sum(_s,	NUniquePatents);
  if last.year then h.add();
 end;
 array t(0:2) _temporary_;
 do until(last.id);
  set temp;
  by id year;
  if first.year then do;
   _n+1;
   _n1=mod(_n,3);
   h.find();
   t(_n1)=_s;
   sum=sum(of t(*));
  end;
  output;
 end;
 h.clear();
 call missing(of t(*),of _:);
 drop _:;
run;

proc print noobs;
run;

results.PNG

 

StevenR
Obsidian | Level 7

Sorry seeing this message now, thank you for the detailed reply 🙂

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2909 views
  • 4 likes
  • 4 in conversation