Hi,
Assume I have data set like the one below:
ID | Year | Var1 | Var2 |
123 | 2012 | 345 | 24 |
310 | 2012 | 4 | 324 |
400 | 2012 | 234 | 456 |
123 | 2013 | 675 | 46 |
200 | 2013 | 577 | 89 |
310 | 2013 | 46 | 45 |
400 | 2013 | 45 | 76 |
123 | 2014 | 123 | 234 |
200 | 2014 | 1234 | 35 |
400 | 2014 | 435 | 645 |
123 | 2015 | 231 | 236 |
200 | 2015 | 456 | 87 |
310 | 2015 | 435 | 534 |
400 | 2015 | 232 | 235 |
123 | 2016 | 123 | 679 |
200 | 2016 | 354 | 523 |
310 | 2016 | 233 | 754 |
400 | 2016 | 200 | 342 |
Now suppose, I want to create a new data set from this data set where for each year and ID I want to get the summation of the previous three years values for var 1 and var2...so that I get an output like this:
ID | Year | Var1 | Var2 |
123 | 2015 | 1143 | 304 |
200 | 2015 | 1811 | 124 |
310 | 2015 | 50 | 369 |
400 | 2015 | 714 | 1177 |
123 | 2016 | 1029 | 516 |
200 | 2016 | 2267 | 211 |
310 | 2016 | 481 | 579 |
400 | 2016 | 712 | 956 |
So for example, for ID 400 in 2016 I get the sum of the values for VAR 1 and VAR 2 in years 2015, 2014, 2013.
Notice that not in all years do all the IDs appear. Also suppose I have a large number of variables (VAR1 TO VARn) so I don't want to be typing in all their names.
How do I go about this?
Thank you!
You mention that not all years are present. How does that affect the sum? If the first year for an ID is 2012 and you have 2014 and 2015 but not 2013 do you want to sum 2012, 2014 and 2015 or only 2012 and 2014?
What if there are not 3 years for an ID? Especially what to do if there is only 1?
You may need to adjust your example output. You say "sum 3 years" but for ID 123 you have years 2012, 2013, 2014, 2015 and 2016. You show a desired sum of 1143 for VAR1 but that is the sum for 2012-2014 but you don't show any result for 2016 which from the description should be the sum of 2013-2015?
Hi,
So for example the ID = 200 is missing in 2012. So the output for ID=200 in 2015 should be the sum of only 2013 and 2014. So even if an ID is missing in the previous three years the sum should take only whatever the values that are available from the previous three years.
If there is only 1 year for an ID from the previous year, the sum would be only the value of that avaialble year.
There was a slight error in the example. I adjusted it.
Hope you can help me out now to get the given output file.
Thank you!!
If your familiar with formats look at multilabel formats.
Otherwise use a SQL self join with the condition being
a.year - b.year between 0 and 2
data have; infile cards truncover expandtabs; input ID Year Var1 Var2; cards; 123 2012 345 24 310 2012 4 324 400 2012 234 456 123 2013 675 46 200 2013 577 89 310 2013 46 45 400 2013 45 76 123 2014 123 234 200 2014 1234 35 400 2014 435 645 123 2015 231 236 200 2015 456 87 310 2015 435 534 400 2015 232 235 123 2016 123 679 200 2016 354 523 310 2016 233 754 400 2016 200 342 ; run; proc sql; create table temp as select a.*,var1,var2 from (select * from (select distinct id from have),(select distinct year from have)) as a left join have as b on a.year=b.year and a.id=b.id order by a.id,a.year; quit; data want; set temp; by id; array v1{0:2} _temporary_; array v2{0:2} _temporary_; if first.id then n=0; n+1; sum_v1=sum(of v1{*}); sum_v2=sum(of v2{*}); v1{mod(n,3)}=var1; v2{mod(n,3)}=var2; if n gt 3; keep id year sum_:; run;
Hey thank you so much!
Instead of var1 and var2, if I have a lot of variables that I want to do the same thing as above how would you code it? So that I don't have to keep on typing the variable names...
Make some macro variable to hold it . data have; infile cards truncover expandtabs; input ID Year Var1 Var2; cards; 123 2012 345 24 310 2012 4 324 400 2012 234 456 123 2013 675 46 200 2013 577 89 310 2013 46 45 400 2013 45 76 123 2014 123 234 200 2014 1234 35 400 2014 435 645 123 2015 231 236 200 2015 456 87 310 2015 435 534 400 2015 232 235 123 2016 123 679 200 2016 354 523 310 2016 233 754 400 2016 200 342 ; run; proc transpose data=have(obs=0 drop=id year) out=x; run; proc sql noprint; select _name_ into : names separated by ',' from x; select catt('array _',_name_,'{0:2} _temporary_;') into : arrays separated by ' ' from x; select catt('sum_',_name_,'=sum(of _',_name_,'{*});') into : sum separated by ' ' from x; select catt('_',_name_,'{mod(n,3)}=',_name_,';') into : v separated by ' ' from x; quit; proc sql; create table temp as select a.*,&names from (select * from (select distinct id from have),(select distinct year from have)) as a left join have as b on a.year=b.year and a.id=b.id order by a.id,a.year; quit; data want; set temp; by id; &arrays if first.id then n=0; n+1; ∑ &v if n gt 3; keep id year sum_:; run;
∑ should be & sum
Thank you! Sorry to trouble you, but could you tell give me the correct code for the final part:
data want; set temp; by id; &arrays if first.id then n=0; n+1; ∑ &v if n gt 3; keep id year sum_:; run;
I replaced
∑
but the resulting output has empty values now for the summed columns.
Sorry to bother you. Could you retype the final part:
data want; set temp; by id; &arrays if first.id then n=0; n+1; ∑ &v if n gt 3; keep id year sum_:; run;
This forum keep eating my code. Hope this time could work. data have; infile cards truncover expandtabs; input ID Year Var1 Var2; cards; 123 2012 345 24 310 2012 4 324 400 2012 234 456 123 2013 675 46 200 2013 577 89 310 2013 46 45 400 2013 45 76 123 2014 123 234 200 2014 1234 35 400 2014 435 645 123 2015 231 236 200 2015 456 87 310 2015 435 534 400 2015 232 235 123 2016 123 679 200 2016 354 523 310 2016 233 754 400 2016 200 342 ; run; proc transpose data=have(obs=0 drop=id year) out=x; run; proc sql noprint; select _name_ into : names separated by ',' from x; select catt('array _',_name_,'{0:2} _temporary_;') into : arrays separated by ' ' from x; select catt('sum_',_name_,'=sum(of _',_name_,'{*});') into : sums separated by ' ' from x; select catt('_',_name_,'{mod(n,3)}=',_name_,';') into : v separated by ' ' from x; quit; proc sql; create table temp as select a.*,&names from (select * from (select distinct id from have),(select distinct year from have)) as a left join have as b on a.year=b.year and a.id=b.id order by a.id,a.year; quit; data want; set temp; by id; &arrays if first.id then n=0; n+1; &sums &v if n gt 3; keep id year sum_:; run;
Thank you so much! it worked now. I will try it with my original data and see now.
I truly appreciate your help. 🙂
Hi, so I ran the code with my original data. And in the final steps I got the following error (see attachement).
Note: ID variable is termed as CUSIP.
BATEMAN_EICHLER_HILL_RICHARD was one of the variables in the original data.
I would greatly appreciate if you can give this a look.
The name is too long - maximum 32 chars in SAS. You'll need to truncate the prefix or your variable name somehow.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.